-
Incident report
-
Resolution: Fixed
-
Major
-
1.8.3
Trigger starts action with escalation configured. When the trigger is going back to normal, the action is not aborted and continues with it's escalations.
Item which triggers the trigger is OK but action thinks it is still PROBLEM and continuing escalations. Some recovery information is simply lost.
Issue is easy to replicate when there are lots of work to do by zabbix_server, for example restart all our machines, restart zabbix_server process.
After zabbix_server restart all triggers agent.ping (agent.ping.nodata(120)}=1) are triggered - ok, no data came from agents from long time.
After some time, data from agents came and server should switch all triggers to OK status. It is doesn't happend and some actions are escalated all the time. Only truncate table 'escalations' help in this situation and break faulty escalations.
Example from out infrastructure.
1. stop zabbix_server process for couple of minutes, start zabbix_server process
2. all triggers agent.ping triggers (including our machine called v144)
3. most of triggers changed from PROBLEM to OK
4. trigger for v144 (triggerid 80442) also changed from PROBLEM to OK but there is still in escalations table:
mysql> select * from escalations where triggerid = '80442';
-----------------------------------------------------------------------+
escalationid | actionid | triggerid | eventid | r_eventid | nextcheck | esc_step | status |
-----------------------------------------------------------------------+
2883 | 10 | 80442 | 7134802 | 0 | 1285248956 | 5 | 0 |
2884 | 23 | 80442 | 7134802 | 0 | 1285248129 | 12 | 0 |
-----------------------------------------------------------------------+
In Monitoring->Events is also still in PROBLEM status.
5. truncate table escalations (delete from escalations) helps and escalations actions is break.
Out system characteristics:
Number of hosts (monitored/not monitored/templates) 995 903 / 34 / 58
Number of items (monitored/disabled/not supported) 64046 56209 / 7806 / 31
Number of triggers (enabled/disabled)[problem/unknown/ok] 24550 22855 / 1695 [29 / 13 / 22813]
Required server performance, new values per second 503.01 -
In other word, some recovery information is lost during server heavy activity.
Feel free to contact. I can give more detailed information when needed.