Loading...

XML

Word

Printable

Type: Incident report
Resolution: Fixed
Priority: Major
Fix Version/s: 2.0.0
Affects Version/s: 1.8.3
Component/s: Server (S)
Labels:
- escalations

Trigger starts action with escalation configured. When the trigger is going back to normal, the action is not aborted and continues with it's escalations.
Item which triggers the trigger is OK but action thinks it is still PROBLEM and continuing escalations. Some recovery information is simply lost.

Issue is easy to replicate when there are lots of work to do by zabbix_server, for example restart all our machines, restart zabbix_server process.

After zabbix_server restart all triggers agent.ping (agent.ping.nodata(120)}=1) are triggered - ok, no data came from agents from long time.
After some time, data from agents came and server should switch all triggers to OK status. It is doesn't happend and some actions are escalated all the time. Only truncate table 'escalations' help in this situation and break faulty escalations.

Example from out infrastructure.
1. stop zabbix_server process for couple of minutes, start zabbix_server process
2. all triggers agent.ping triggers (including our machine called v144)
3. most of triggers changed from PROBLEM to OK
4. trigger for v144 (triggerid 80442) also changed from PROBLEM to OK but there is still in escalations table:

mysql> select * from escalations where triggerid = '80442';
-----------------------------------------------------------------------+

escalationid

actionid

triggerid

eventid

r_eventid

nextcheck

esc_step

status

-----------------------------------------------------------------------+

2883	10	80442	7134802	0	1285248956	5	0
2884	23	80442	7134802	0	1285248129	12	0

-----------------------------------------------------------------------+

In Monitoring->Events is also still in PROBLEM status.
5. truncate table escalations (delete from escalations) helps and escalations actions is break.

Out system characteristics:
Number of hosts (monitored/not monitored/templates) 995 903 / 34 / 58
Number of items (monitored/disabled/not supported) 64046 56209 / 7806 / 31
Number of triggers (enabled/disabled)[problem/unknown/ok] 24550 22855 / 1695 [29 / 13 / 22813]
Required server performance, new values per second 503.01 -

In other word, some recovery information is lost during server heavy activity.

Feel free to contact. I can give more detailed information when needed.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

action.png
106 kB
2010 Sep 23 16:44
events.png
35 kB
2010 Sep 23 16:44
trigger_status.png
32 kB
2010 Sep 23 16:44

duplicates

ZBX-4020 DB deadlock when using multiple discoverers

Closed

ZBX-990 Actions: Faulty Escalation / Delayed Recovery

Closed

Assignee:: Unassigned

Reporter:: Robert Jerzak

Votes:: 2 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2010 Sep 23 16:44

Updated:: 2017 May 30 18:24

Resolved:: 2012 Jan 03 23:13

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates