Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-3033

Faulty Escalation/Recovery events

XMLWordPrintable

    • Icon: Incident report Incident report
    • Resolution: Fixed
    • Icon: Major Major
    • 2.0.0
    • 1.8.3
    • Server (S)

      Trigger starts action with escalation configured. When the trigger is going back to normal, the action is not aborted and continues with it's escalations.
      Item which triggers the trigger is OK but action thinks it is still PROBLEM and continuing escalations. Some recovery information is simply lost.

      Issue is easy to replicate when there are lots of work to do by zabbix_server, for example restart all our machines, restart zabbix_server process.

      After zabbix_server restart all triggers agent.ping (agent.ping.nodata(120)}=1) are triggered - ok, no data came from agents from long time.
      After some time, data from agents came and server should switch all triggers to OK status. It is doesn't happend and some actions are escalated all the time. Only truncate table 'escalations' help in this situation and break faulty escalations.

      Example from out infrastructure.
      1. stop zabbix_server process for couple of minutes, start zabbix_server process
      2. all triggers agent.ping triggers (including our machine called v144)
      3. most of triggers changed from PROBLEM to OK
      4. trigger for v144 (triggerid 80442) also changed from PROBLEM to OK but there is still in escalations table:

      mysql> select * from escalations where triggerid = '80442';
      -----------------------------------------------------------------------+

      escalationid actionid triggerid eventid r_eventid nextcheck esc_step status

      -----------------------------------------------------------------------+

      2883 10 80442 7134802 0 1285248956 5 0
      2884 23 80442 7134802 0 1285248129 12 0

      -----------------------------------------------------------------------+

      In Monitoring->Events is also still in PROBLEM status.
      5. truncate table escalations (delete from escalations) helps and escalations actions is break.

      Out system characteristics:
      Number of hosts (monitored/not monitored/templates) 995 903 / 34 / 58
      Number of items (monitored/disabled/not supported) 64046 56209 / 7806 / 31
      Number of triggers (enabled/disabled)[problem/unknown/ok] 24550 22855 / 1695 [29 / 13 / 22813]
      Required server performance, new values per second 503.01 -

      In other word, some recovery information is lost during server heavy activity.

      Feel free to contact. I can give more detailed information when needed.

        1. action.png
          action.png
          106 kB
        2. events.png
          events.png
          35 kB
        3. trigger_status.png
          trigger_status.png
          32 kB

            Unassigned Unassigned
            rob Robert Jerzak
            Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: