[#ZBX-4732] Events with wrong timestamp during high load on zabbix server -> wrong Availability report

[ZBX-4732] Events with wrong timestamp during high load on zabbix server -> wrong Availability report Created: 2012 Mar 06 Updated: 2017 May 30 Resolved: 2015 Aug 13
Status:	Closed
Project:	ZABBIX BUGS AND ISSUES
Component/s:	Frontend (F), Server (S)
Affects Version/s:	1.8.10
Fix Version/s:	None

Type:

Incident report

Priority:

Critical

Reporter:

Daniel Kontsek

Assignee:

Unassigned

Resolution:

Cannot Reproduce

Votes:

Labels:

nodata, timer

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original Estimate:

Not Specified

Environment:

Linux (EL 6), Mysql (5.1)

Attachments:

O_new_trigger_error.jpg

e0.png

e1.png

zabbix_1.png

zabbix_2.png

Issue Links:

Duplicate

Description

Sometimes we observe a situation when zabbix produces false positive alerts (based on trigger

{agent.ping.nodata(180)}

=1) during high IO load on the server (i.e. while running a backup on FS, which holds the zabbix database). We suppose that this leads to a problem when events are stored in switched order, which is probably caused by wrong clock values. Although the event IDs seem to be stored in right order (please see the attached pictures). The Availability report generates then graphs with hosts mostly down.

Maybe this is somehow related to #~~ZBX-4466~~.

Comments

Comment by Oleksii Zagorskyi [ 2012 Mar 07 ]

I suppose that I managed to hit in this case several days ago when I played with a trigger expression where nodata(30-60) function has been used.

./zabbix_server18 -V
Zabbix Server v1.8.11rc1 (revision 25522) (28 December 2011)
Compilation time: Feb 22 2012 10:42:33

Comment by Daniel Kontsek [ 2012 Mar 07 ]

It's mentioned in the bug report -

{agent.ping.nodata(180)}

Comment by Oleksii Zagorskyi [ 2012 Mar 15 ]

Very similar issue is ~~ZBX-4763~~, maybe even a source of the problem is the same.

<ADDED> also similar issue ~~ZBX-6170~~

Comment by Oleksii Zagorskyi [ 2012 Apr 14 ]

Heh, I would like to share with my opinion.

At pictures attached by Daniel (the issue reporter) we see repeated several PROBLEM events in a row. It's not very clear why that happened. I cannot imagine.
Additionally we don't know update interval for that item, maybe it is 180 seconds ? (the same as trigger function).
But in the events we see that OK event generated exactly at start of minute. OK events can be generated only by "db syncer" process when some value is received and the trigger is in PROBLEM state.
PROBLEM events can be generated only by "timer" process when trigger is in OK or UNKNOWN (because of server restart) states.

So, I suppose data came from the item exactly at the start of minute (processed by "db syncer"), and we know that "timer" executed every 30 seconds exactly at 00 and 30 seconds.

The events, to which Daniel draws attention, are less interesting for me than other events.
For example:
10:16:00 - PROBLEM,
10:16:00 - OK
and
00:04:00 - PROBLEM,
00:04:01 - OK

I can show more clear case. See attached "O_new_trigger_error.jpg", there you will find all detail.
We see that the trigger has been processed by two processes almost in the same time.
11:02:31 = eventID 10959666 - "timer" process generated PROBLEM event
11:02:30 = eventID 10959668 - "db syncer" process generated OK event

Here is not clear why "db syncer" decided that the trigger is in PROBLEM state. It's possible that "timer" already changed it to PROBLEM (in some cache or in the table)
And later "db syncer" probably changed the state to OK. :/

So we should prevent such cases somehow.
Zabbix server version is 1.8.6 in this last example.

Comment by Daniel Kontsek [ 2012 Apr 17 ]

Item key: agent.ping
Item type: Zabbix Agent
Item update interval: 60 s
Trigger:

{agent.ping.nodata(180)}

Comment by Daniel Kontsek [ 2012 Jun 05 ]

Any news regarding this problem?

Comment by Cristian Mammoli [ 2014 Jun 06 ]

Hi, we are having the exact same problem (see attachments)

Zabbix 2.2.3, DB PostgreSQL 9.2

Comment by Roelof Spijker [ 2014 Sep 26 ]

Seeing a very similar issue here on 2.2.3 with mysql. Events are generated in the incorrect order. The real order would be: Up - Down for 1 second - Up. But they are ordered as Up - Up - Down for 1 second. This causes the SLA to record it as being down up until the next issue occurs and is resolved. It's fixable by decreasing the clock in the DB for the events and service_alarms, but I'm not sure why it's happening in the first place.

Comment by Oleksii Zagorskyi [ 2015 Aug 13 ]

I feel that this issue is not very actual as for recent zabbix versions (2.4+).
So I'm closing it as ... hmmm ... cannot reproduce.

Feel free to ask to reopen if you think I did wrong thing.

Just note that there are ZBX-8556 and ZBX-9432 which may look similar to current issue, but they are different.

Generated at Fri Apr 26 17:59:09 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.

[ZBX-4732] Events with wrong timestamp during high load on zabbix server -> wrong Availability report Created: 2012 Mar 06 Updated: 2017 May 30 Resolved: 2015 Aug 13