-
Incident report
-
Resolution: Duplicate
-
Critical
-
None
-
3.2.1
Our server upgraded to 3.2.0 a month ago and a few days ago we got an issue. During the issue we upgraded to 3.2.1 but it did not help.
History of used versions: 1.8 2.2 2.4 3.0 3.2
We don't use any new feature (like events corellations etc).
Nothing has changed in zabbix close to the failure point.
DB server is separate and had not any issue (although had very high CPU user time some time ago).
But zabbix server host VM has migrated to another ESX because of some reason and after that the issue appeared.
History syncers were too busy often and for long period (maybe because of DebugLevel=4 all the time), it could be related.
(after decreased debug to 3, syncers became 5% busy, while with level 4 they were normally 70% busy).
We use GTID-based master-master replication on MySQL. (mysql Ver 15.1 Distrib 10.0.25-MariaDB, for Linux (x86_64) using readline 5.1)
During investigation it gets clear that many triggers are in PROBLEM state (value=1 in "triggers") generate OK events each time when get new value or when being recalculated by timer.
In the same time these triggers DO NOT have any record in "problem" table.
In the same time these triggers have OK/PROBLEM events in "events" table.
The picture looks this way - if you visit Problems page - you see nothing bad, i.e you see nothing, in any view mode.
on dashboard you can see problem state for many triggers in host groups,
on Last 20 issues many triggers being updated constantly and their Age gets small all the time.
on "Triggers top 100" report you can see huge "Number of status changes" for these triggers.
Difference in debug log:
Affected zabbix:
17023:20161108:153349.436 End of DCget_nextid() table:'events' [96:95]
17023:20161108:153349.436 query [txnlev:1] [insert into events (eventid,source,object,objectid,clock,ns,value) values (95,0,0,13569,1478612028,511579021,0);
]
17023:20161108:153349.436 In process_actions() events_num:1
17023:20161108:153349.436 In zbx_dc_get_actions_eval()
17023:20161108:153349.436 End of zbx_dc_get_actions_eval() actions:1
17023:20161108:153349.436 End of process_actions()
17023:20161108:153349.436 In DBupdate_itservices()
17023:20161108:153349.436 End of DBupdate_itservices():SUCCEED
17023:20161108:153349.436 End of process_trigger_events() processed:1
17023:20161108:153349.436 In zbx_save_trigger_changes()
17023:20161108:153349.436 query [txnlev:1] [update triggers set lastchange=1478612028 where triggerid=13569;
]
another, not affected zabbix:
6242:20161108:103022.270 End of DCget_nextid() table:'events' [85:84] 6242:20161108:103022.270 query [txnlev:1] [insert into events (eventid,source,object,objectid,clock,ns,value) values (84,0,0,13569,1478593821,591608632,0); ] 6242:20161108:103022.271 query [txnlev:1] [insert into event_recovery (eventid,r_eventid,correlationid,c_eventid,userid) values (83,84,null,null,null); ] 6242:20161108:103022.271 query [txnlev:1] [update problem set r_eventid=84,r_clock=1478593821,r_ns=591608632,userid=0 where eventid=83; ] 6242:20161108:103022.272 In process_actions() events_num:1 6242:20161108:103022.272 In zbx_dc_get_actions_eval() 6242:20161108:103022.272 End of zbx_dc_get_actions_eval() actions:1 6242:20161108:103022.272 query [txnlev:1] [select actionid,eventid,escalationid from escalations where eventid=83] 6242:20161108:103022.272 query [txnlev:1] [update escalations set r_eventid=84 where escalationid=1; ]
We resolved the issue using this direct SQL:
update triggers set value=0 where value=1 and triggerid not in (select distinct objectid from problem);
Question why this issue appeared is still opened, but I think about 2 points:
1. In any conditions it should not happen that "problem" table is not consistent with triggers and/or events.
2. Zabbix server should be able to "survive" that if it still happens.
This issue is somehow similar to ZBX-11426, but still scenario how it happened is different.
Below is some collected information, just in case:
MariaDB [zabbix]> SELECT *, FROM_UNIXTIME(clock) FROM events WHERE objectid=129403 AND clock>=unix_timestamp('2016-11-08 09:55:00') AND clock<=unix_timestamp('2016-11-08 10:00:00') ORDER BY clock DESC; +-----------+--------+--------+----------+------------+-------+--------------+-----------+----------------------+ | eventid | source | object | objectid | clock | value | acknowledged | ns | FROM_UNIXTIME(clock) | +-----------+--------+--------+----------+------------+-------+--------------+-----------+----------------------+ | 279907078 | 0 | 0 | 129403 | 1478595600 | 0 | 0 | 26280341 | 2016-11-08 10:00:00 | | 279906763 | 0 | 0 | 129403 | 1478595574 | 0 | 0 | 6126234 | 2016-11-08 09:59:34 | | 279906657 | 0 | 0 | 129403 | 1478595570 | 0 | 0 | 977226025 | 2016-11-08 09:59:30 | | 279906092 | 0 | 0 | 129403 | 1478595544 | 0 | 0 | 993842012 | 2016-11-08 09:59:04 | ... | 279901329 | 0 | 0 | 129403 | 1478595334 | 0 | 0 | 65403047 | 2016-11-08 09:55:34 | | 279900981 | 0 | 0 | 129403 | 1478595330 | 0 | 0 | 249855840 | 2016-11-08 09:55:30 | | 279900642 | 0 | 0 | 129403 | 1478595304 | 0 | 0 | 64580586 | 2016-11-08 09:55:04 | | 279900299 | 0 | 0 | 129403 | 1478595300 | 0 | 0 | 438725918 | 2016-11-08 09:55:00 | +-----------+--------+--------+----------+------------+-------+--------------+-----------+----------------------+
MariaDB [zabbix]> select * from triggers where triggerid=129403\G *************************** 1. row *************************** triggerid: 129403 expression: {321462}=0 or {321463}=1 description: ESXi is unreachable url: http://some status: 0 value: 1 priority: 5 lastchange: 1478601964 comments: error: templateid: 129305 type: 0 state: 0 flags: 0 recovery_mode: 0 recovery_expression: correlation_mode: 0 correlation_tag: manual_close: 1 1 row in set (0.00 sec)
MariaDB [zabbix]> SELECT DATE(FROM_UNIXTIME(p.clock)) AS date, HOUR(FROM_UNIXTIME(p.clock)) AS hour, p.source, count(*) FROM problem p JOIN triggers t ON p.objectid = t.triggerid WHERE p.object='0' AND p.source='0' AND p.clock>=unix_timestamp('2016-11-01 00:00:00') AND p.clock<=unix_timestamp('2016-11-08 18:00:00') GROUP BY 1,2,3 ORDER BY p.clock DESC,p.eventid DESC; +------------+------+--------+----------+ | date | hour | source | count(*) | +------------+------+--------+----------+ | 2016-11-08 | 10 | 0 | 160 | | 2016-11-08 | 9 | 0 | 280 | ... | 2016-11-07 | 11 | 0 | 392 | | 2016-11-07 | 10 | 0 | 431 | | 2016-11-07 | 9 | 0 | 104 | | 2016-11-07 | 8 | 0 | 60 | | 2016-11-07 | 7 | 0 | 61 | | 2016-11-07 | 6 | 0 | 61 | | 2016-11-07 | 5 | 0 | 60 | ... | 2016-11-06 | 9 | 0 | 60 | | 2016-11-06 | 8 | 0 | 60 | | 2016-11-06 | 7 | 0 | 61 | | 2016-11-06 | 6 | 0 | 60 | | 2016-11-06 | 5 | 0 | 60 | | 2016-11-06 | 4 | 0 | 46 | | 2016-11-05 | 22 | 0 | 1 | | 2016-11-05 | 14 | 0 | 1 | | 2016-11-05 | 6 | 0 | 1 | | 2016-11-05 | 2 | 0 | 1 | ... | 2016-11-03 | 11 | 0 | 1 | | 2016-11-03 | 9 | 0 | 1 | | 2016-11-02 | 21 | 0 | 2 | | 2016-11-02 | 16 | 0 | 1 | | 2016-11-01 | 9 | 0 | 1 | +------------+------+--------+----------+
MariaDB [zabbix]> SELECT DATE(FROM_UNIXTIME(e.clock)) AS date, HOUR(FROM_UNIXTIME(e.clock)) AS hour, e.source, count(*) FROM events e JOIN triggers t ON e.objectid = t.triggerid WHERE e.object='0' AND e.source='0' AND e.clock>=unix_timestamp('2016-11-01 00:00:00') AND e.clock<=unix_timestamp('2016-11-08 18:00:00') GROUP BY 1,2,3 ORDER BY e.clock DESC,e.eventid DESC; +------------+------+--------+----------+ | date | hour | source | count(*) | +------------+------+--------+----------+ | 2016-11-08 | 10 | 0 | 46782 | | 2016-11-08 | 9 | 0 | 72477 | | 2016-11-08 | 8 | 0 | 72392 | | 2016-11-08 | 7 | 0 | 72258 | ... | 2016-11-04 | 13 | 0 | 53050 | | 2016-11-04 | 12 | 0 | 53467 | | 2016-11-04 | 11 | 0 | 53062 | | 2016-11-04 | 10 | 0 | 38379 | | 2016-11-04 | 9 | 0 | 3502 | | 2016-11-04 | 8 | 0 | 510 | | 2016-11-04 | 7 | 0 | 424 | | 2016-11-04 | 6 | 0 | 325 | ... | 2016-11-03 | 19 | 0 | 447 | | 2016-11-03 | 18 | 0 | 515 | | 2016-11-03 | 17 | 0 | 531 | +------------+------+--------+----------+
MariaDB [zabbix]> select from_unixtime(min(clock)) from problem; +---------------------------+ | from_unixtime(min(clock)) | +---------------------------+ | 2015-11-09 13:47:09 | +---------------------------+
MariaDB [zabbix]> select from_unixtime(min(clock)) from events where objectid=129403; +---------------------------+ | from_unixtime(min(clock)) | +---------------------------+ | 2016-02-17 15:19:04 | +---------------------------+
MariaDB [zabbix]> select count(*) from problem where objectid=129403; +----------+ | count(*) | +----------+ | 0 | +----------+
MariaDB [zabbix]> select count(*) from triggers where value=1 and triggerid not in (select distinct objectid from events); +----------+ | count(*) | +----------+ | 573 | +----------+
MariaDB [zabbix]> select count(*) from triggers where value=1 and triggerid not in (select distinct objectid from events where source=0 and value=1); +----------+ | count(*) | +----------+ | 577 | +----------+
MariaDB [zabbix]> update triggers set value=0 where value=1 and triggerid not in (select distinct objectid from problem); Query OK, 887 rows affected (1 min 36.75 sec) Rows matched: 887 Changed: 887 Warnings: 0
MariaDB [zabbix]> select description, from_unixtime(lastchange) from triggers where value=1 and triggerid not in (select distinct objectid from events); Empty set (2.70 sec)
- duplicates
-
ZBX-11454 Force OK events to update trigger value even if there are no open problems
- Closed