Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-8667

Trigger dependencies ignored when a maintenance window expires

    XMLWordPrintable

Details

    • Incident report
    • Resolution: Duplicate
    • Major
    • None
    • 2.2.5
    • Server (S)
    • None

    Description

      We are successfully using trigger dependencies and escalations to prevent notification storms. We have several hundred sites being monitored where the ping trigger status of all servers at a given site is dependent on the ping trigger status of the router at the same site using a standard trigger dependency.

      We ping all devices every five minutes and use a 2 step escalation on our actions with a 5:30 delay to allow enough time for trigger dependencies to be resolved before firing the action.

      This all works beautifully EXCEPT when both the dependent and antecedent device are still in a problem state when a maintenance period (covering both devices) expires. When maintenance expires, actions are created for both devices and the trigger dependency is seemingly ignored.

      This behaviour has been consistent for a large number of incidents since we implemented a few months ago despite our best efforts and experimentation.

      Please allow me to detail the most recent occurrence:

      • RouterA and ServerA are pinged every five minutes
      • ServerA's ping trigger depends on RouterA's ping trigger
      • All ping items and triggers are applied via a common template
      • Trigger expression: `
        Unknown macro: {Template ICMP Echo}

        =100`

      • Both devices are covered by the same recurring maintenance window (with data collection) which starts at 5pm daily and expires at 7am the next day.
      • One action is defined to generate an incident notification with the following conditions:
      • Maintenance status is not maintenance
      • Trigger value = PROBLEM
      • Trigger = Template ICMP Echo: Ping test failed
      • The Action operation step duration is 330 seconds
      • The notification is generated as step 2 of the action.

      The most recent sequence of events is as follows:

      • Day 1
      • 26/08/14 17:00 Both devices go into maintenance mode. Both in OK state
      • 26/08/14 18:54 ServerA has first 100% packet loss
      • 26/08/14 18:55 RouterA has first 100% packet loss
      • 26/08/14 19:04 ServerA trigger correctly switches to PROBLEM state
      • 26/08/14 19:05 RouterA trigger correctly switches to PROBLEM state
      • Day 2
      • 27/08/14 07:00 Both devices come out of maintenance mode. Both in consistent PROBLEM state all night
      • 27/08/14 07:00 Event is created for ServerA and RouterA
      • 27/08/14 07:07:18 Step 2 notification is sent for ServerA (despite trigger dependency on RouterA)
      • 27/08/14 07:07:24 Step 2 notification is sent for RouterA (which we expect)
      • 27/08/14 07:39 ServerA recovers (no packet loss)
      • 27/08/14 07:40 RouterA recovers (no packet loss)

      Attachments

        1. Action Conditions.png
          Action Conditions.png
          104 kB
        2. Action Operation Steps.png
          Action Operation Steps.png
          111 kB
        3. Maintenance Window.png
          Maintenance Window.png
          91 kB
        4. RouterA Event.png
          RouterA Event.png
          199 kB
        5. RouterA Packet Loss.png
          RouterA Packet Loss.png
          135 kB
        6. ServerA Event.png
          ServerA Event.png
          221 kB
        7. ServerA Packet Loss.png
          ServerA Packet Loss.png
          134 kB
        8. Trigger Dependency.png
          Trigger Dependency.png
          119 kB

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ryan.armstrong Ryan Armstrong
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: