-
Incident report
-
Resolution: Duplicate
-
Major
-
None
-
2.2.5
-
None
We are successfully using trigger dependencies and escalations to prevent notification storms. We have several hundred sites being monitored where the ping trigger status of all servers at a given site is dependent on the ping trigger status of the router at the same site using a standard trigger dependency.
We ping all devices every five minutes and use a 2 step escalation on our actions with a 5:30 delay to allow enough time for trigger dependencies to be resolved before firing the action.
This all works beautifully EXCEPT when both the dependent and antecedent device are still in a problem state when a maintenance period (covering both devices) expires. When maintenance expires, actions are created for both devices and the trigger dependency is seemingly ignored.
This behaviour has been consistent for a large number of incidents since we implemented a few months ago despite our best efforts and experimentation.
Please allow me to detail the most recent occurrence:
- RouterA and ServerA are pinged every five minutes
- ServerA's ping trigger depends on RouterA's ping trigger
- All ping items and triggers are applied via a common template
- Trigger expression: `
Unknown macro: {Template ICMP Echo}
=100`
- Both devices are covered by the same recurring maintenance window (with data collection) which starts at 5pm daily and expires at 7am the next day.
- One action is defined to generate an incident notification with the following conditions:
- Maintenance status is not maintenance
- Trigger value = PROBLEM
- Trigger = Template ICMP Echo: Ping test failed
- The Action operation step duration is 330 seconds
- The notification is generated as step 2 of the action.
The most recent sequence of events is as follows:
- Day 1
- 26/08/14 17:00 Both devices go into maintenance mode. Both in OK state
- 26/08/14 18:54 ServerA has first 100% packet loss
- 26/08/14 18:55 RouterA has first 100% packet loss
- 26/08/14 19:04 ServerA trigger correctly switches to PROBLEM state
- 26/08/14 19:05 RouterA trigger correctly switches to PROBLEM state
- Day 2
- 27/08/14 07:00 Both devices come out of maintenance mode. Both in consistent PROBLEM state all night
- 27/08/14 07:00 Event is created for ServerA and RouterA
- 27/08/14 07:07:18 Step 2 notification is sent for ServerA (despite trigger dependency on RouterA)
- 27/08/14 07:07:24 Step 2 notification is sent for RouterA (which we expect)
- 27/08/14 07:39 ServerA recovers (no packet loss)
- 27/08/14 07:40 RouterA recovers (no packet loss)
- duplicates
-
ZBX-4344 dependent event stuck in escalations
- Closed