[ZBX-9939] wrong events (ok|problem) for item(trapper) sending via zabbix_sender in the same time Created: 2015 Oct 06  Updated: 2017 May 30  Resolved: 2016 Sep 30

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 2.4.5
Fix Version/s: None

Type: Incident report Priority: Blocker
Reporter: Natalia Kagan Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: events, triggers
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

zabbix 2.4.4 on centos 6


Attachments: XML File Template app trap test.xml     JPEG File events.jpg     PNG File events.png     PNG File latest data.png     JPEG File test_results.jpg     PNG File zbx-9939-events.png    
Issue Links:
Duplicate
duplicates ZBX-9640 Incorrect timestamp using zabbix_send... Closed

 Description   

Hi,

We have a problem once two text messages (first is "problem" and the second is "ok") are sending via zabbix_sender from the same host to item(trapper) in the same time, zabbix create triggers in wrong order (the last one "problem") but in latest data it appeared correct.
It's happened from time to time.



 Comments   
Comment by Oleksii Zagorskyi [ 2015 Oct 07 ]

Custom trigger severity colors, a bit different style of frontend, supposedly customized.

Taking into account that values are multiline ones, it's not possible to send from single input file, but still, to be sure:
How do you send those 2 vales? Are values in command line or you use input file(s)?
Calling zabbix_sender twice or sending a file with values using single call of zabbix_sender?
If from file - do you have <timestamp> specified (and --with-timestamps for zabbix_get)?

We don't see other possibly existing triggers and don't see full trigger name when it expanded on events page.
Is this events filter for only one trigger?
(looks like it is taking into account values in Duration column).

Does this trigger have "multiple problem events generation" enabled?
Did you check the "All nodes are reachable ..." value for possibly matching "severity:Minor" text in other lines (which you have hidden)?
Posting full text values as text instead of screenshots are welcome.

Please do not to hide too much information next time, to not force us to guess
(because such issue reports most likely will be ignored/skipped)

Could be related to ZBX-9432.

Comment by Natalia Kagan [ 2015 Oct 07 ]

Sorry, will try to provide more details next time

I am able to reproduce the issue:

on host "server141" (centos 6.4) I run 2 zabbix_sender commands in the same time:

%/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_]' -o severity:Warning; /usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_]' -o severity:normal 
info from server: "processed: 1; failed: 0; total: 1; seconds spent: 0.000040"
sent: 1; skipped: 0; total: 1
info from server: "processed: 1; failed: 0; total: 1; seconds spent: 0.000036"
sent: 1; skipped: 0; total: 1

key[LPTRAP_60_] is a part of LLD, item type "Zabbix trapper" ,

trigger (with Multiple PROBLEM events generation):
    Warning {server141:key[LPTRAP_60_].regexp(severity:Warning)}=1
    Minor {server141:key[LPTRAP_60_].regexp(severity:Minor)}=1
    Major {server141:key[LPTRAP_60_].regexp(severity:Major)}=1
    Critical {server141:key[LPTRAP_60_].regexp(severity:Critical)}=1

in latest data I have :

Timestamp Value
2015-10-07 14:05:01 severity:normal
2015-10-07 14:05:01 severity:Warning

in Monitoring-> triggers :

Severity Status Info Last change Age Acknowledged Host Name Description
Warning PROBLEM   2015-10-07 14:05:01 16m 17s Acknowledge (7) server141   severity:normal:LPTRAP_60_:Wa

in Monitoring -> events (as you can see - trigger status is "PROBLEM" but should be "OK")::

Time Host Description Status Severity Duration Ack Actions
2015-10-07 14:05:01 server141 severity:normal:LPTRAP_60_:Wa PROBLEM Warning 0 No -
2015-10-07 14:05:01 server141 severity:Warning:LPTRAP_60_:Wa PROBLEM Warning 0 No -

If I send zabbix_sender separate (one after other):

%/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_]' -o severity:Warning
info from server: "processed: 1; failed: 0; total: 1; seconds spent: 0.000044"
sent: 1; skipped: 0; total: 1

%/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_]' -o severity:normal 
info from server: "processed: 1; failed: 0; total: 1; seconds spent: 0.000030"
sent: 1; skipped: 0; total: 1

in Monitoring -> events : there is no problem (trigger status is "OK"):

Time Host Description Status Severity Duration Ack Actions
2015-10-07 14:23:34 server141 severity:normal:LPTRAP_60_:Wa OK Warning 56s No -
2015-10-07 14:23:27 server141 severity:Warning:LPTRAP_60_:Wa PROBLEM Warning 7s No -

Let me know if you need more details

Many Thanks for the help!

Comment by Oleksii Zagorskyi [ 2015 Oct 07 ]

Ho many DB syncers do you have?
Whats is real NVPS?
Db type?
Any errors in zabbix_server.log?

Comment by Natalia Kagan [ 2015 Oct 07 ]

20 DB syncers
14 proxy servers in 7 different locations (2 proxy servers in each location)
DB type : mysql 5.6 on ssd
zabbix server, DB,WEB installed on separate servers (centos 6)
NVPS 6242.67
on proxy that I send the example : vps 295.11
db on proxy also mysql 5.6

Comment by Oleksii Zagorskyi [ 2015 Oct 07 ]

Do you want to say that you sent those 2 values to a zabbix_proxy?
This is a very important detail, and had to be mentioned from very beginning !!!
What is version of the proxy ?

Not sure it can be related to your case, just FYI ZBX-7825

Comment by Natalia Kagan [ 2015 Oct 07 ]

Yes.correct. I send it to proxy.
Version is 2.4.4 for proxy, server and agent

Not sure that it relevant since in latest data I see it in correct order.

Comment by Oleksii Zagorskyi [ 2015 Oct 07 ]

Could you try to reproduce it creating only one regular item and trigger, not LLD ones ?
Maybe creating additional host for testing, pointing it to the proxy of course.

Comment by Marc [ 2015 Oct 07 ]

I'm wondering whether it could be related to ZBX-6942.

Comment by Natalia Kagan [ 2015 Oct 08 ]

it's not happened to regular item and trigger, the problem only for LLD one but it's not happened every time but very often

I create the regular item and trigger on the same host and did the test :

sending to LLD one :

% /usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_]' -o severity:Warning;/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_]' -o severity:normal;/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_]' -o severity:Warning;/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_]' -o severity:normal
info from server: "processed: 1; failed: 0; total: 1; seconds spent: 0.000049"
sent: 1; skipped: 0; total: 1
info from server: "processed: 1; failed: 0; total: 1; seconds spent: 0.000040"
sent: 1; skipped: 0; total: 1
info from server: "processed: 1; failed: 0; total: 1; seconds spent: 0.000057"
sent: 1; skipped: 0; total: 1
info from server: "processed: 1; failed: 0; total: 1; seconds spent: 0.000023"
sent: 1; skipped: 0; total: 1

sending to regular item:

%/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_test]' -o severity:Warning;/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_test]' -o severity:normal;/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_test]' -o severity:Warning;/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_test]' -o severity:normal
info from server: "processed: 1; failed: 0; total: 1; seconds spent: 0.000053"
sent: 1; skipped: 0; total: 1
info from server: "processed: 1; failed: 0; total: 1; seconds spent: 0.000146"
sent: 1; skipped: 0; total: 1
info from server: "processed: 1; failed: 0; total: 1; seconds spent: 0.000036"
sent: 1; skipped: 0; total: 1
info from server: "processed: 1; failed: 0; total: 1; seconds spent: 0.000040"
sent: 1; skipped: 0; total: 1

in Monitoring -> events :
for LLD it's create incorrect events ("severity:normal" should be in status "OK"):

Time Host Description Status Severity Duration Ack Actions
2015-10-08 07:51:01 server141 severity:normal:LPTRAP_60_:Wa PROBLEM Warning 0 No -
2015-10-08 07:51:01 server141 severity:Warning:LPTRAP_60_:Wa PROBLEM Warning 0 No -
2015-10-08 07:51:01 server141 severity:normal:LPTRAP_60_:Wa PROBLEM Warning 0 No -
2015-10-08 07:51:01 server141 severity:Warning:LPTRAP_60_:Wa PROBLEM Warning 0 No -

for regular it's ok :

Time Host Description Status Severity Duration Ack Actions
2015-10-08 07:51:06 server141 LPTRAP_60_test warning OK Warning 5m 32s No -
2015-10-08 07:51:06 server141 LPTRAP_60_test warning PROBLEM Warning 0 No -
2015-10-08 07:51:06 server141 LPTRAP_60_test warning OK Warning 0 No -
2015-10-08 07:51:06 server141 LPTRAP_60_test warning PROBLEM Warning 0 No -

Thanks

Comment by Oleksii Zagorskyi [ 2015 Oct 08 ]

ok, please confirm as for LLD:
The item has only 4 lld triggers based on it? as we see in screenshot (and in a comment as text)?

Comment by Natalia Kagan [ 2015 Oct 09 ]

Correct, only for lld
Yes, 4 lld triggers

Thanks

Comment by Oleksii Zagorskyi [ 2015 Oct 09 ]

When you tested with regular item/trigger - did you create also 4 triggers for the regular item?

ok, one more request for test:
Could you try to reproduce it for newly created host but with LLD created item/4triggers.
Why - structures and relation of LLD entities in zabbix database is quite complex. I would like to know that you can reproduce it with fresh, newly created LLD entities.

Comment by Natalia Kagan [ 2015 Oct 12 ]

I created new template with lld (4 triggers) and regular item (4 triggers) assign it to the new host (reporting to the same proxy like the first host) and now I am not able to reproduce the issue (it happened only once)
what does it mean ?
I have 5000 hosts assign to the template and I need the history, so I can't reassign them

Thanks

Comment by Oleksii Zagorskyi [ 2015 Oct 12 ]

Looking to your previous attachments I believe that zabbix server really did something wrong. And I don't like that.

Older zabbix server versions had some issues with LLD objects creation (like ZBX-5781, ZBX-5218). So, theoretically, maybe your failed lld triggers are related to those issues if for example they were created by older server versions.

Bad thing is that this issue is randomly reproducible, so most likely it cannot be related to possibly wrong data in database, but who knows ....
It looks like more like a performance related issue, I mean your 6242 NVPS is not so small and may produce some effect on db syncers etc.

All said - IMO.

I'd happy to reproduce it but I'm afraid I'd spend a lot of time without success to reproduce it.
That's why I ask you for more tests with a hope to find some exact pattern on your zabbix installation.

Comment by Oleksii Zagorskyi [ 2015 Oct 12 ]

If you can reproduce it at least once (even if reproducible randomly) for newly created item/triggers - it's also good results for us, which forces us to consider it seriously.
Please confirm for newly created item/triggers.

Comment by Natalia Kagan [ 2015 Oct 12 ]

ok, I have bad/good news

I did the tests again with attached template (Template app trap test.xml) include : lld (4 triggers) and regular item (4 triggers)

and the results are (see in attached file test_results.jpg) :

lld from 211 tests trigger was wrong 3 times
regular item from 211 tests trigger was wrong also 3 times (not in the same time like lld)

in order to use my template :

for lld, run

/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k external.lptraps.discovery.test -o "{\"data\":[{\"{#TRAPKEY}\":\"LPTRAP_60_\"}]}"

then run :

/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_1]' -o severity:Warning;
/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_1]' -o severity:normal;
/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_1]' -o severity:Warning;
/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_1]' -o severity:normal;
/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_1]' -o severity:Warning;
/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_1]' -o severity:normal;
/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_1]' -o severity:Warning;
/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_1]' -o severity:normal

for regular item :

/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_test]' -o severity:Warning;
/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_test]' -o severity:normal;
/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_test]' -o severity:Warning;
/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_test]' -o severity:normal;
/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_test]' -o severity:Warning;
/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_test]' -o severity:normal;
/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_test]' -o severity:Warning;
/usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -k 'key[LPTRAP_60_test]' -o severity:normal

Let me know if you want me to do more tests

btw, I started to work with zabbix in v2.0, then 2.2,2.4,2.4.4
the problematic template was created in vesion 2.2 but was assign to hosts only in v2.4.4

Thanks!

Comment by Oleksii Zagorskyi [ 2015 Oct 12 ]

Try please other tests:
1. leave only one trigger (which matches the problem value) for the item and retry;
2. add more triggers, to have for example 10 or 20 triggers (similarly to existing 3 ones which do not match problem value) for single item and retry.

Comment by Natalia Kagan [ 2015 Oct 12 ]

with one trigger during 200 tests :

lld didn't failed
regular failed 1 time

looks like it failed from time to time without dependency if it's lld or regular item and how many triggers define

Comment by Natalia Kagan [ 2015 Oct 12 ]

with 10 triggers :

lld failed 9 times
regular item failed 1 time

Comment by Oleg Ivanivskyi [ 2016 Apr 30 ]

You can reproduce this issue by sending several values via Zabbix sender at the same time and using "{Zabbix server:test.strlen()}>0" trigger with enabled "Multiple PROBLEM events generation".
Generated events:

Commands executed in one go are:

zabbix_sender -z 127.0.0.1 -s "Zabbix server" -k test -o 'bal bla'
zabbix_sender -z 127.0.0.1 -s "Zabbix server" -k test -o ""
zabbix_sender -z 127.0.0.1 -s "Zabbix server" -k test -o "bal bla"
zabbix_sender -z 127.0.0.1 -s "Zabbix server" -k test -o ""

My environment - Zabbix server 2.4.7, no proxies.

Comment by Oleg Ivanivskyi [ 2016 Apr 30 ]

In the DB:

+----+--------+----------------------+---------+-----------+
| id | itemid | from_unixtime(clock) | value   | ns        |
+----+--------+----------------------+---------+-----------+
|  9 |  23660 | 2016-04-29 21:00:43  | bal bla |  31162306 |
| 10 |  23660 | 2016-04-29 21:00:43  |         |  32888625 |
| 11 |  23660 | 2016-04-29 21:00:43  | bal bla |  34332592 |
| 12 |  23660 | 2016-04-29 21:00:43  |         |  35673574 |
+----+--------+----------------------+---------+-----------+

Comment by Glebs Ivanovskis (Inactive) [ 2016 May 23 ]

This may happen simply because it's internet and there is no guarantee that values sent with independent zabbix_sender commands will arrive and be processed by server in the same order. Things that might help:

  • server log with DebugLevel=4
  • problem/recovery messages with {ITEM.VALUE} and {ITEM.LASTVALUE}
  • captured TCP dump
Comment by Natalia Kagan [ 2016 May 25 ]

Hi,

there is a problem to provide server log with DebugLevel=4 and tcp dump since I have ~6000 VPS in zabbix server.
but as you can see in previous comment, there is a way to reproduce the issue, maybe it will be possible that you will do it in your env and see all relevant information ?

Thanks!

Comment by Glebs Ivanovskis (Inactive) [ 2016 May 25 ]

Dear Natalia,

since there is nothing specific about your installation and the issue can be reproduced there is no need to increase debug log level or capture TCP dump on 6000 NVPS installation. Thank you for the information you've already provided. My comment was addressed to Oleg, because so far I've failed to reproduce the issue myself (but I was using trunk, new history cache implemented in 3.0). I apologise for misleading you.

Comment by Natalia Kagan [ 2016 May 25 ]

Thanks for the update!
happy that nothing should be done from my side

Comment by Glebs Ivanovskis (Inactive) [ 2016 May 26 ]

Natalia,

few questions for you. As I understand, this issue is randomly reproducible with quite low probability (0-9 failures out of 200 tests) and for both ordinary, templated and LLD items/triggers. This makes me think that trigger evaluation must be correct, but sometimes pre-last value is used instead of last one to evaluate trigger.

  • Is there any difference in failure chances when you send values through proxy or directly to server?
  • What version of sender do you use?
  • Does server run on virtual or physical machine?
  • Is there any chance of clock adjustments (NTP or something)?

I will be very glad to see listings from the database (ids, clocks and ns from history and events) like the one Oleg provided.

I think ZBX-6942 is not related. My initial guess about TCP connections not arriving in order can explain OK/PROBLEM order swapping in ZBX-6942, but here we have a miscalculated trigger. And DB listing from Oleg shows that values arrived (and were written to DB) in order.

So, minor clock adjustments or discontinuities and/or race condition between history syncers are my best guesses at the moment.

Comment by Natalia Kagan [ 2016 May 31 ]

Hi,

Sorry for delay.

1. We don't send value to server, only via proxy. We have very complex structure : zabbix agent -> proxy VIP (F5) -> proxy server -> zabbix server VIP (F5) -> zabbix server -> zabbix DB VIP (F5) -> zabbix DB server
2. version is 2.4.4
3. Zabbix server, DB on physical (centos 6); proxy and web (virtual on openstack, also centos 6)
4. all servers are sync by NTP

Could you provide me the query for DB (mysql 5.6) for ids, clocks and ns from history and events ?

Oleg was able to reproduce the problem in your env. , so maybe it's not relate to my env. and configuration ...

Thanks for your help
Natalia

Comment by Glebs Ivanovskis (Inactive) [ 2016 May 31 ]

No worries! Thank you for the information.

Unfortunately, Oleg dropped his environment where he reproduced the issue and his attempt to recreate it were unsuccessful.

Can I assume that you know problematic event clock and item/host which generated it?

Comment by Glebs Ivanovskis (Inactive) [ 2016 May 31 ]

Just noticed a weird thing in Although trigger value is displayed as PROBLEM description of the event is like it's OK, am I correct?

Comment by Natalia Kagan [ 2016 May 31 ]

Yes, description(item.value) is correct but status and severity are wrong
Regarding the query, I know host and item
Thanks

Comment by Glebs Ivanovskis (Inactive) [ 2016 Jun 01 ]

Hi!

Sorry for a delay. Query should look like this (you need to choose appropriate history table and replace HOSTNAME, ITEM_NAME and TIMESTAMP):

select e.*, hi.*
from events as e, history/history_log/history_str/history_text/history_uint as hi, triggers as t, functions as f, items as i, hosts as ho
where ho.name = HOSTNAME and i.name = ITEM_NAME and e.clock = TIMESTAMP
and hi.itemid = i.itemid and e.clock = hi.clock and e.ns = hi.ns
and e.source = 0 and e.object = 0 and e.objectid = t.triggerid
and f.triggerid = t.triggerid and f.itemid = i.itemid;

This will return events of all triggers for the item generated at given clock, so you might want to massage the output afterwards.

Looking forward to your reply!

Comment by Glebs Ivanovskis (Inactive) [ 2016 Jun 01 ]

There is a scenario in which it may happen:

  1. value A is sent
  2. value B is sent
  3. value B is processed and assigned a timestamp by trapper
  4. value A is processed and assigned a timestamp by trapper
  5. value A is flushed to history cache
  6. value A gets processed by history syncer, trigger evaluates to 1, PROBLEM event is generated
  7. value B is flushed to history cache
  8. value B gets processed by history syncer, since timestamp(A)>timestamp(B) trigger assumes last value is A and evaluates to 1, PROBLEM event is generated

For this to happen order must be changed twice as A and B progress along the pipeline. This is unlikely but not impossible. However, this contradicts Oleg's DB contents where timestamps are in perfect order. That's a pity we can't double-check this!

With a small addition to process_hist_data() which adds extra 2 ms to non-empty values it is possible to simulate aforementioned scenario.

		if ('\0' != *tmp)
		{
			av->ts.ns += 2000000;

			if (1000000000 < av->ts.ns)
			{
				av->ts.ns -= 1000000000;
				av->ts.sec++;
			}
		}

I send:

$ src/zabbix_sender/zabbix_sender -z localhost -p 10051 -s Dummy -k test.trapper -o test && src/zabbix_sender/zabbix_sender -z localhost -p 10051 -s Dummy -k test.trapper -o ''
info from server: "processed: 1; failed: 0; total: 1; seconds spent: 0.000138"
sent: 1; skipped: 0; total: 1
info from server: "processed: 1; failed: 0; total: 1; seconds spent: 0.000060"
sent: 1; skipped: 0; total: 1

This generates two PROBLEM events:

select e.*, hi.* from events as e, history_text as hi, triggers as t, functions as f, items as i, hosts as ho where ho.name = 'Dummy' and i.name = 'Test' and hi.itemid = i.itemid and e.source = 0 and e.object = 0 and e.objectid = t.triggerid and f.triggerid = t.triggerid and f.itemid = i.itemid and e.clock = hi.clock and e.ns = hi.ns and e.clock = 1464778041;
 eventid | source | object | objectid |   clock    | value | acknowledged |    ns     | id  | itemid |   clock    | value |    ns     
---------+--------+--------+----------+------------+-------+--------------+-----------+-----+--------+------------+-------+-----------
     302 |      0 |      0 |    13609 | 1464778041 |     1 |            0 | 960187415 | 236 |  23770 | 1464778041 | test  | 960187415
     303 |      0 |      0 |    13609 | 1464778041 |     1 |            0 | 959718399 | 237 |  23770 | 1464778041 |       | 959718399
(2 rows)
Comment by Glebs Ivanovskis (Inactive) [ 2016 Jun 06 ]

Dear Natalia,

contents of your DB will still be much appreciated as they can prove or disprove my hypothesis. Also I would like to ask about NTP settings and how it behaves in your setup. Does it adjust clock ticking speed or perform jumps (and how big they are)? Does "hardware clock" of proxy's virtual environment drift away from host machine's clock?

Comment by Natalia Kagan [ 2016 Jun 07 ]

Hi,

regarding NTP, we have the same ntp.conf on all hosts.
proxy servers are virtual on KVM/openstack
zabbix server/DB are physical
hosts can be physical, virtual (VMware/openstack)

%less /etc/ntp.conf | grep -v "^#" | grep -v "^$"
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
restrict 127.0.0.1 
restrict -6 ::1
fudge   127.127.1.0 stratum 10
driftfile /var/lib/ntp/drift
keys /etc/ntp/keys
server xxx.xxx.xxx.xxx
...

I don't have more information, let me know if you need more details

regarding the DB data, I am try to running the test now and will update

Thanks
Natalia

Comment by Natalia Kagan [ 2016 Jun 13 ]

Good morning,

I am not able to reproduce the problem with my test template anymore (:-
we didn't upgrade zabbix and still with version 2.4.4, so I can't explain this.

I will try to reproduce it later again and will update

Thanks!

Comment by Glebs Ivanovskis (Inactive) [ 2016 Jun 13 ]

Good morning!

I'm glad that you are not experiencing the issue any more. I would recommend not to use two sender calls when the order of values is crucial. To be on a safe side it's better to put values in file and call zabbix_sender once.

My thoughts on how timekeeping and NTP issues may lead to this situation and ideas how we can improve things are registered as ZBXNEXT-3298.

Comment by Natalia Kagan [ 2016 Jun 13 ]

I am not sure that the problem not exists anymore, for real alerts we implement workaround : added "sleep 5" between zabbix_sender, I just not able to reproduce this.

We have no control how many/when zabbix_sender should send the value since events are coming from application code (we have very complex solution that using lld trapper)

Comment by dimir [ 2016 Sep 27 ]

Overview

Was able to reproduce the problem. Used trapper item with trigger {Zabbix server:test.strlen()}<3 (with multiple PROBLEM events generation), monitored by proxy, all components of latest 2.2 (2.2.15rc1 r62671).

In my case the problem happened when the values sent were ending up in different parcels. In order to guarantee that I had to apply the following change to Zabbix proxy:

Index: include/proxy.h
===================================================================
--- include/proxy.h     (revision 62691)
+++ include/proxy.h     (working copy)
@@ -26,7 +26,8 @@
 #define ZBX_PROXYMODE_ACTIVE   0
 #define ZBX_PROXYMODE_PASSIVE  1
 
-#define ZBX_MAX_HRECORDS       1000
+//#define ZBX_MAX_HRECORDS     1000
+#define ZBX_MAX_HRECORDS       1
 
 #define AGENT_VALUE    struct zbx_agent_value_t

This guarantees the history data parcels sent from proxy to server always contain 1 value.

I was feeding PROBLEM, OK values sent one by one by pairs using zabbix_sender to the proxy:

while true; do for i in a bbb; do bin/zabbix_sender -z 192.168.1.100 -s "Zabbix server" -k test -o $i; done; sleep 3; done
  • PROBLEM value: "a" (less than 3 characters)
  • OK value: "bbb"

The proxy sets value timestamps as current timestamp. Then the server, when receiving history data is adjusting every value timestamp by the time difference between proxy and server (proxy_timediff), here's what server does for each history data parcel:

  1. get current timestamp
  2. get timestamp of the parcel
  3. calculate the difference as proxy_timediff
  4. apply proxy_timediff to every value timestamp

Time difference between proxy and server (proxy_timediff) is supposed to guarantee the order of history data flow, but it appears in some cases (network latency) it can, on the contrary, screw the timestamps of values in separate parcels.

In the following example 3 pairs of PROBLEM, OK values were sent (that's 6 parcels of "history data", 1 value per parcel). The values of the second pair (3rd and 4th parcels) end up on the server with crossed timestamps, resulting in 3rd value (which is PROBLEM) processed 2 times and 4th not processed at all. This happens because when processing the 4th value the 3rd is still the latest (greater timestamp).

Simple view of the problem

What happens on the server

This is taken from zabbix_server.log

action value sec ns comment
receive parcel 1 'a' 1474979070 631108357 proxy_timediff: 0.008509230
receive parcel 2 'bbb' 1474979070 636563317 proxy_timediff: 0.004952259
insert value 1 'a' 1474979070 639617587  
add event 1 PROBLEM 1474979070 639617587  
receive parcel 3 'a' 1474979073 651108838 proxy_timediff: 0.008793273
receive parcel 4 'bbb' 1474979073 656584887 proxy_timediff: 0.002290530
insert value 2 bbb' 1474979070 641515576  
add event 2 OK 1474979070 641515576  
insert value 3 'a' 1474979073 659902111  
add event 3 PROBLEM 1474979073 659902111  
insert value 4 'bbb' 1474979073 658875417 after applying proxy_timefiff the value timestamp becomes in the past from value 3
add event 4 PROBLEM 1474979073 658875417 At the moment of processing value 4, value 3 has the latest timestamp thus a FALSE-PROBLEM event is generated
receive parcel 5 'a' 1474979076 681714069 proxy_timediff: 0.004925885
receive parcel 6 'bbb' 1474979076 686862613 proxy_timediff: 0.002742814
insert value 5 'a' 1474979076 686639954  
add event 5 PROBLEM 1474979076 686639954  
insert value 6 'bbb' 1474979076 689605427  
add event 6 OK 1474979076 689605427  

Detailed view of the problem

History values sent by proxy

zabbix_proxy.log:

-- parcel 1 (PROBLEM value)
 29682:20160927:152434.271 In send_data_to_server() [{
        "request":"history data",
        "host":"Zabbix proxy",
        "data":[
                {
                        "host":"Zabbix server",
                        "key":"test",
                        "clock":1474979070,
                        "ns":631108357,
                        "value":"a"}],
        "clock":1474979074,
        "ns":271853707}]
-- parcel 2 (OK value)
 29682:20160927:152435.287 In send_data_to_server() [{
        "request":"history data",
        "host":"Zabbix proxy",
        "data":[
                {
                        "host":"Zabbix server",
                        "key":"test",
                        "clock":1474979070,
                        "ns":636563317,
                        "value":"bbb"}],
        "clock":1474979075,
        "ns":287670248}]
-- parcel 3 (PROBLEM value)
 29682:20160927:152437.296 In send_data_to_server() [{
        "request":"history data",
        "host":"Zabbix proxy",
        "data":[
                {
                        "host":"Zabbix server",
                        "key":"test",
                        "clock":1474979073,
                        "ns":651108838,
                        "value":"a"}],
        "clock":1474979077,
        "ns":296742783}]
-- parcel 4 (OK value)
 29682:20160927:152438.309 In send_data_to_server() [{
        "request":"history data",
        "host":"Zabbix proxy",
        "data":[
                {
                        "host":"Zabbix server",
                        "key":"test",
                        "clock":1474979073,
                        "ns":656584887,
                        "value":"bbb"}],
        "clock":1474979078,
        "ns":309290537}]
-- parcel 5 (PROBLEM value)
 29682:20160927:152442.319 In send_data_to_server() [{
        "request":"history data",
        "host":"Zabbix proxy",
        "data":[
                {
                        "host":"Zabbix server",
                        "key":"test",
                        "clock":1474979076,
                        "ns":681714069,
                        "value":"a"}],
        "clock":1474979082,
        "ns":319267404}]
-- parcel 6 (OK value)
 29682:20160927:152443.327 In send_data_to_server() [{
        "request":"history data",
        "host":"Zabbix proxy",
        "data":[
                {
                        "host":"Zabbix server",
                        "key":"test",
                        "clock":1474979076,
                        "ns":686862613,
                        "value":"bbb"}],
        "clock":1474979083,
        "ns":327434212}]
--

This is the part of the log where server was receiving the parcels. When receiving, server is adjusting value timestamps according to proxy_timediff. Notice how after adjustment the timestamp of 4th parcel becomes in the past from 3rd.

History values received by server from proxy

zabbix_server.log:

-- parcel 1 (PROBLEM value)
                        "clock":1474979070,
                        "ns":631108357,
                        "value":"a"}],
 29645:20160927:152434.280 DIMIR: proxy_timediff: 0.009 sec
 29645:20160927:152434.280 DIMIR: process_hist_data() after proxy_timediff adjustment [  a] 1474979070.639617587
-- parcel 2 (OK value)
                        "clock":1474979070,
                        "ns":636563317,
                        "value":"bbb"}],
 29644:20160927:152435.292 DIMIR: proxy_timediff: 0.005 sec
 29644:20160927:152435.292 DIMIR: process_hist_data() after proxy_timediff adjustment [bbb] 1474979070.641515576 <-- all good: after proxy_timediff adjustment the second value is still in the future from the first
-- parcel 3 (PROBLEM value)
                        "clock":1474979073,
                        "ns":651108838,
                        "value":"a"}],
 29642:20160927:152437.305 DIMIR: proxy_timediff: 0.009 sec
 29642:20160927:152437.305 DIMIR: process_hist_data() after proxy_timediff adjustment [  a] 1474979073.659902111
-- parcel 4 (OK value, but after server adjusted value timestamp it ends up before 3rd value!)
                        "clock":1474979073,
                        "ns":656584887,
                        "value":"bbb"}],
 29641:20160927:152438.311 DIMIR: proxy_timediff: 0.002 sec                                                       <-- BUG: proxy_timediff of the second parcel is much smaller than the first (0.009), the difference is 0.007 seconds, while the difference of value timestamps is just 0.005!
 29641:20160927:152438.311 DIMIR: process_hist_data() after proxy_timediff adjustment [bbb] 1474979073.658875417  <-- BUG: after proxy_timediff adjustment the second value is in the past from the first!
-- parcel 5 (PROBLEM value)
                        "clock":1474979076,
                        "ns":681714069,
                        "value":"a"}],
 29641:20160927:152442.324 DIMIR: proxy_timediff: 0.005 sec
 29641:20160927:152442.324 DIMIR: process_hist_data() after proxy_timediff adjustment [  a] 1474979076.686639954
-- parcel 6 (OK value)
                        "clock":1474979076,
                        "ns":686862613,
                        "value":"bbb"}],
 29641:20160927:152443.330 DIMIR: proxy_timediff: 0.003 sec
 29641:20160927:152443.330 DIMIR: process_hist_data() after proxy_timediff adjustment [bbb] 1474979076.689605427 <-- all good: after proxy_timediff adjustment the second value is still in the future from the first
--

Proxy history table

mysql> select *,from_unixtime(clock) from proxy_history where clock between 1474979070 and 1474979076 order by clock,ns;
+------+--------+------------+-----------+--------+----------+-------+------------+-----------+-------+----------------------+
| id   | itemid | clock      | timestamp | source | severity | value | logeventid | ns        | state | from_unixtime(clock) |
+------+--------+------------+-----------+--------+----------+-------+------------+-----------+-------+----------------------+
| 1719 |  23661 | 1474979070 |         0 |        |        0 | a     |          0 | 631108357 |     0 | 2016-09-27 15:24:30  |
| 1720 |  23661 | 1474979070 |         0 |        |        0 | bbb   |          0 | 636563317 |     0 | 2016-09-27 15:24:30  |
| 1721 |  23661 | 1474979073 |         0 |        |        0 | a     |          0 | 651108838 |     0 | 2016-09-27 15:24:33  |
| 1722 |  23661 | 1474979073 |         0 |        |        0 | bbb   |          0 | 656584887 |     0 | 2016-09-27 15:24:33  |
| 1723 |  23661 | 1474979076 |         0 |        |        0 | a     |          0 | 681714069 |     0 | 2016-09-27 15:24:36  |
| 1724 |  23661 | 1474979076 |         0 |        |        0 | bbb   |          0 | 686862613 |     0 | 2016-09-27 15:24:36  |
+------+--------+------------+-----------+--------+----------+-------+------------+-----------+-------+----------------------+
6 rows in set (0.00 sec)

Server history table (order of 3rd and 4th values is screwed)

mysql> select * from history_str where clock between 1474979070 and 1474979076 order by clock,ns;
+--------+------------+-------+-----------+
| itemid | clock      | value | ns        |
+--------+------------+-------+-----------+
|  23661 | 1474979070 | a     | 639617587 |
|  23661 | 1474979070 | bbb   | 641515576 |
|  23661 | 1474979073 | bbb   | 658875417 |
|  23661 | 1474979073 | a     | 659902111 |
|  23661 | 1474979076 | a     | 686639954 |
|  23661 | 1474979076 | bbb   | 689605427 |
+--------+------------+-------+-----------+
6 rows in set (0.00 sec)

Event generation

 29653:20160927:152435.560 query [txnlev:1] [insert into events (eventid,source,object,objectid,clock,ns,value) values (38638,0,0,13558,1474979070,639617587,1);
 29652:20160927:152440.548 query [txnlev:1] [insert into events (eventid,source,object,objectid,clock,ns,value) values (38639,0,0,13558,1474979070,641515576,0);
 29652:20160927:152440.553 query [txnlev:1] [insert into events (eventid,source,object,objectid,clock,ns,value) values (38640,0,0,13558,1474979073,659902111,1);
 29652:20160927:152440.556 query [txnlev:1] [insert into events (eventid,source,object,objectid,clock,ns,value) values (38641,0,0,13558,1474979073,658875417,1); <-- BUG
 29652:20160927:152445.558 query [txnlev:1] [insert into events (eventid,source,object,objectid,clock,ns,value) values (38642,0,0,13558,1474979076,686639954,1);
 29652:20160927:152445.565 query [txnlev:1] [insert into events (eventid,source,object,objectid,clock,ns,value) values (38643,0,0,13558,1474979076,689605427,0);

Trigger changes

--
 29653:20160927:152435.560 In process_trigger() triggerid:13558 value:0(0) new_value:1
 29653:20160927:152435.560 query [txnlev:1] [update triggers set lastchange=1474979070,value=1 where triggerid=13558;
--
 29652:20160927:152440.548 In process_trigger() triggerid:13558 value:1(0) new_value:0
 29652:20160927:152440.548 query [txnlev:1] [update triggers set lastchange=1474979070,value=0 where triggerid=13558;
--
 29652:20160927:152440.553 In process_trigger() triggerid:13558 value:0(0) new_value:1
 29652:20160927:152440.553 query [txnlev:1] [update triggers set lastchange=1474979073,value=1 where triggerid=13558;
--
 29652:20160927:152440.555 In process_trigger() triggerid:13558 value:1(0) new_value:1
 29652:20160927:152440.555 query [txnlev:1] [update triggers set lastchange=1474979073 where triggerid=13558;        <-- BUG, value is still 1
--
 29652:20160927:152445.558 In process_trigger() triggerid:13558 value:1(0) new_value:1
 29652:20160927:152445.558 query [txnlev:1] [update triggers set lastchange=1474979076 where triggerid=13558;
--
 29652:20160927:152445.565 In process_trigger() triggerid:13558 value:1(0) new_value:0
 29652:20160927:152445.565 query [txnlev:1] [update triggers set lastchange=1474979076,value=0 where triggerid=13558;
--

Trigger expression evaluation

--
 29653:20160927:152435.559 End of zbx_vc_get_value_range():SUCCEED count:1 cached:1
 29653:20160927:152435.560 End of evaluate_function():SUCCEED value:'1'
 29653:20160927:152435.560 zbx_substitute_functions_results() expression[0]:'{13165}<3' => '1<3'
--
 29652:20160927:152440.547 End of zbx_vc_get_value_range():SUCCEED count:1 cached:1
 29652:20160927:152440.547 End of evaluate_function():SUCCEED value:'3'
 29652:20160927:152440.547 zbx_substitute_functions_results() expression[0]:'{13165}<3' => '3<3'
--
 29652:20160927:152440.553 End of zbx_vc_get_value_range():SUCCEED count:1 cached:1
 29652:20160927:152440.553 End of evaluate_function():SUCCEED value:'1'
 29652:20160927:152440.553 zbx_substitute_functions_results() expression[0]:'{13165}<3' => '1<3'
--
 29652:20160927:152440.555 End of zbx_vc_get_value_range():SUCCEED count:1 cached:1
 29652:20160927:152440.555 End of evaluate_function():SUCCEED value:'1'
 29652:20160927:152440.555 zbx_substitute_functions_results() expression[0]:'{13165}<3' => '1<3'
--
 29652:20160927:152445.558 End of zbx_vc_get_value_range():SUCCEED count:1 cached:1
 29652:20160927:152445.558 End of evaluate_function():SUCCEED value:'1'
 29652:20160927:152445.558 zbx_substitute_functions_results() expression[0]:'{13165}<3' => '1<3'
--
 29652:20160927:152445.565 End of zbx_vc_get_value_range():SUCCEED count:1 cached:1
 29652:20160927:152445.565 End of evaluate_function():SUCCEED value:'3'
 29652:20160927:152445.565 zbx_substitute_functions_results() expression[0]:'{13165}<3' => '3<3'
--

Monitoring ->Events

Comment by dimir [ 2016 Sep 28 ]

The problem happens because of inefficient proxy/server time difference calculation.

I believe this was partly fixed in ZBX-9640 by calculating proxy_timediff before actually receiving the data packet (less dependency on network latency), not after. Your version 2.2.4 does not contain the fix. You need to upgrade to at least 2.4.8rc1, 3.0.0rc1 .

From the latest versions I could not reproduce my scenario on other than 2.2.

Comment by Veeresh K [ 2016 Sep 29 ]

I am probably hitting the same issue. I have two trapper items, (numeric/unsigned), I am sending data for same via protobix (python lib). The values does get updated in zabbix, however the triggers dont generate (triggers are checking if last value is non zero). This is happening in lan env with nvps of about 37. I am on 3.0.4.

However if I send the same data again, the triggers are generated.

Comment by Glebs Ivanovskis (Inactive) [ 2016 Sep 29 ]

Dear vkhanorkar, if it's a newly created trigger you are probably not waiting enough for the server to update configuration from database. I don't think your problem is related to this one.

<dimir> Moreover, this issue is already fixed (at least partly) in 3.0.4.

Comment by Veeresh K [ 2016 Sep 29 ]

@Glebs, Let me explain further. The first time the item value changes to nonzero to zero or vice versa, the trigger is not updated. But the next time, such value is updated, the triggers gets created immediately. Regarding waiting, the trigger was observed for about 3-4 minutes, but was nt updated. However once the same value was sent again, the trigger was generated immediately.

Comment by Glebs Ivanovskis (Inactive) [ 2016 Sep 29 ]

Dear vkhanorkar, you didn't get my point. How much time passes between creation of trigger and the moment of sending new value that is supposed to generate event. If the server wasn't aware of the existence of the trigger, that explains why it passed the value through itself without generating an event.

Comment by Alexander Vladishev [ 2016 Sep 30 ]

Closed as duplicate of ZBX-9640.

Generated at Fri Apr 04 10:35:06 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.