Loading...

XML

Word

Printable

Type: Incident report
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: 2.2.5
Component/s: Server (S)
Labels:
None

We are successfully using trigger dependencies and escalations to prevent notification storms. We have several hundred sites being monitored where the ping trigger status of all servers at a given site is dependent on the ping trigger status of the router at the same site using a standard trigger dependency.

We ping all devices every five minutes and use a 2 step escalation on our actions with a 5:30 delay to allow enough time for trigger dependencies to be resolved before firing the action.

This all works beautifully EXCEPT when both the dependent and antecedent device are still in a problem state when a maintenance period (covering both devices) expires. When maintenance expires, actions are created for both devices and the trigger dependency is seemingly ignored.

This behaviour has been consistent for a large number of incidents since we implemented a few months ago despite our best efforts and experimentation.

Please allow me to detail the most recent occurrence:

RouterA and ServerA are pinged every five minutes

ServerA's ping trigger depends on RouterA's ping trigger

All ping items and triggers are applied via a common template

Trigger expression: `
Unknown macro: {Template ICMP Echo}

=100`

Both devices are covered by the same recurring maintenance window (with data collection) which starts at 5pm daily and expires at 7am the next day.

One action is defined to generate an incident notification with the following conditions:

Maintenance status is not maintenance

Trigger value = PROBLEM

Trigger = Template ICMP Echo: Ping test failed

The Action operation step duration is 330 seconds

The notification is generated as step 2 of the action.

The most recent sequence of events is as follows:

Day 1

26/08/14 17:00 Both devices go into maintenance mode. Both in OK state

26/08/14 18:54 ServerA has first 100% packet loss

26/08/14 18:55 RouterA has first 100% packet loss

26/08/14 19:04 ServerA trigger correctly switches to PROBLEM state

26/08/14 19:05 RouterA trigger correctly switches to PROBLEM state

Day 2

27/08/14 07:00 Both devices come out of maintenance mode. Both in consistent PROBLEM state all night

27/08/14 07:00 Event is created for ServerA and RouterA

27/08/14 07:07:18 Step 2 notification is sent for ServerA (despite trigger dependency on RouterA)

27/08/14 07:07:24 Step 2 notification is sent for RouterA (which we expect)

27/08/14 07:39 ServerA recovers (no packet loss)

27/08/14 07:40 RouterA recovers (no packet loss)

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Action Conditions.png
104 kB
2014 Aug 27 06:35
Action Operation Steps.png
111 kB
2014 Aug 27 06:35
Maintenance Window.png
91 kB
2014 Aug 27 06:35
RouterA Event.png
199 kB
2014 Aug 27 06:35
RouterA Packet Loss.png
135 kB
2014 Aug 27 06:35
ServerA Event.png
221 kB
2014 Aug 27 06:35
ServerA Packet Loss.png
134 kB
2014 Aug 27 06:35
Trigger Dependency.png
119 kB
2014 Aug 27 06:35

duplicates

ZBX-4344 dependent event stuck in escalations

Closed

Assignee:: Unassigned

Reporter:: Ryan Armstrong

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2014 Aug 27 06:10

Updated:: 2017 May 30 18:11

Resolved:: 2014 Sep 24 08:32