[ZBX-4344] dependent event stuck in escalations Created: 2011 Nov 10  Updated: 2017 May 30  Resolved: 2015 Jul 29

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 1.8.8
Fix Version/s: 2.5.0

Type: Incident report Priority: Major
Reporter: Robert Jerzak Assignee: Unassigned
Resolution: Fixed Votes: 27
Labels: dependencies, escalations, triggerdependencies
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by ZBX-10323 Notification still sent for Trigger a... Closed
is duplicated by ZBX-4835 Actions and triggers dependency problem Closed
is duplicated by ZBX-8667 Trigger dependencies ignored when a m... Closed
is duplicated by ZBX-5796 Escalation emails are not behaving as... Closed

 Description   

We have 2 items:

  • "server unreachable (icmp)" (item: icmpping)
  • "server unreachable (agent)" (item: agent.ping) witch depends on "server unreachable (icmp)"

The host went down. Trigger "server unreachable (icmp)" lights up as supposed to. Administrator ACK this trigger after he received SMS with this trigger.
Unfortunately trigger "server unreachable (agent)" reached and stuck in escalations. Zabbix started to sending SMSes to administrator with "server unreachable (agent)" trigger. There is no way to ACK this trigger, in zabbix GUI there is only "server unreachable (icmp)" already with ACK.

There was no "Deadlock found" in zabbix log related to this issue.

It is not easy to replicate this issue. It happens from time to time.

Specification: http://www.zabbix.org/wiki/Docs/specs/ZBX-4344



 Comments   
Comment by richlv [ 2011 Nov 10 ]

are there trigger dependencies set ? can you see this trigger in monitoring -> triggers ? what if you set trigger status in filter to "any" ?

edit : duh, right, there are dependencies involved...

Comment by richlv [ 2011 Nov 11 ]

ok, so if i read this right, it might "work as expected"

in general, one should configure triggers so the more important ones (or the ones other triggers depend on) fire first - so in this case, making agent.ping less sensitive than icmpping. that would ensure agent.ping firing after icmpping if the whole system goes down.

although there probably should be some safeguard or mechanism to deal with dependent triggers having runaway escalations that can not be easily stopped

Comment by Robert Jerzak [ 2011 Nov 11 ]

Sorry for my English

Having this two above items, when I ACK "server unreachable (icmp)" should it mute all depended triggers (including dependent "server unreachable (agent)")?

In GUI it works ok. In escalations seems to be something wrong.

Comment by Igor Danoshaites (Inactive) [ 2011 Nov 15 ]

>>Having this two above items, when I ACK "server unreachable (icmp)" should it mute all depended triggers (including dependent "server unreachable (agent)")?

Yes, if you have set trigger dependencies correctly, then this should mute ALL depended triggers.

You can also check the content of your "escalations" table: select * from escalations;

Comment by Martin Brassard [ 2012 Mar 13 ]

We have some dependecies where this situation occurs: escalation continues when "depend on" trigger is in problem.

For example, we have a warning severity trigger if the hard drive space is at 80% and a high severity trigger at 90%. The warning trigger is dependent on the high severity one since it is obvious that both will be up if the later is true.

Using this configuration, we ran into the situation reported: since the 80% trigger arrived before the 90% trigger, it fires first and start sending notifications. After some time, the 90% trigger fires off and the 80% is no longer visible in the trigger list, as expected. Yet, we continued receiving the warning severity notifications.

Acknowledging the high severity problem did stop the escalation for the warning severity, but it would be good if the escalation looked at the dependent trigger before sending notification.

Since the situation is pretty close, I posted here instead of raising a new ticket for enhancement. Please inform if I should.

Comment by Ghozlane TOUMI [ 2012 Mar 28 ]

In fact It seems that the escalation system don't account for trigger dependencies at each "operation". it only takes the trigger state when it fires

example: A depends on B, and A fires before B

in the gui, you'll see trigger A firing, and when B's items is checked, trigger dependency works correctly, and only B will appear.
in the escalation system A fires => the escalation system is launched for A and lauch all the operations even after B fires...

The escalation system should check trigger dependencies *before each operation*

Comment by richlv [ 2012 Apr 10 ]

might be related to ZBX-4835

Comment by Lucian Atody [ 2012 Apr 23 ]

Any news about this, please?

Comment by Corey Shaw [ 2012 May 19 ]

I have this exact same problem in 2.0.0rc4. Trigger A depends on Trigger B. If Trigger A goes off first it starts the action escalation process (my action sends an alert 5 minutes after the trigger went off). If Trigger B goes off, Trigger A disappears from the list of problems in the dashboard (like it should), but the action continues to send alerts for Trigger A (even though Trigger B trumps Trigger A).

Comment by Corey Shaw [ 2012 Jun 22 ]

Has there been any movement on this issue at all? This is killing us right now. It's preventing us from switching to Zabbix 100% from Nagios. I'm running on the stable 2.0.0 and am still experiencing this problem.

Comment by richlv [ 2012 Jun 22 ]

can anybody provide exact steps to reproduce what the problem is supposed to be ?
something starting from a clean state zabbix (virtual appliance vm will do), including creating a new host etc

Comment by Corey Shaw [ 2012 Jun 24 ]

Here's the basic steps:

1. Have a host to monitor that listens on port 80 and runs httpd (apache).
2. Create that host in Zabbix. For the purpose of this example, just make sure it has a zabbix agent interface.
3. Create an item on that host that monitors port 80 (net.tcp.service.perf[http,,80]). To increase the likelihood of seeing the issue presented in this ticket, set the interval for this item to something like 5 seconds.
4. Create another item on that host that checks the number of httpd processes (proc.num[httpd]). Set the interval for this item to 60 seconds.
5. Verify that data for both items is being received correctly.
6. Create a trigger on the host that goes off for proc.num[httpd].last(0)=0
7. Create a trigger on the host that goes off for net.tcp.service.perf[http,,80].last(0)=0. Make this trigger dependent on #6.
8. Create an action that uses escalations. For the purpose of this test, create an action that will take effect for all triggers and that has a repeated action every 2 minutes.
9. Verify that the action works as intended.
10. Now shut off httpd on the host that is being monitored.
11. After up to 5 seconds, notice that initially Zabbix detects port 80 as being down and that trigger goes off (as expected). The escalation for the action also begins.
12. After up to 60 seconds, notice that Zabbix detects httpd as not running on the host and sets of the trigger for that. Also notice that the trigger for port 80 disappears from the list on the dashboard (as it is supposed to). The escalation for the action begins for this trigger.
13. Notice after a few minutes that the escalation for #11 continues to go off even though the dependent trigger has gone off. Also, there is nowhere to acknowledge the problem (unless I just don't know about one). As a result, the escalation for #11 goes off until #12 is fixed.

What I expect to have happen is that the escalations in #11 would stop happening since the dependent trigger went off.

Comment by Eugene Istomin [ 2012 Jun 28 ]

We have the same problem.
Trigger dependencies are working correctly in webinterface, but notifications are send despite trigger deps.

Seems that Corey Shaw steps to reproduce have the same logic we have in our system.

Comment by Martin Brassard [ 2012 Aug 01 ]

I am not sure if this occurs on a fresh install, but we tried to disable the "depend on" trigger in hope that we would see the dependent trigger. This didn't work: dependent trigger was still not visible, but "depend on" trigger was also removed. We also encountered this behavior by chance on a live machine and it caused us some headaches. We used a template and had one machine that didn't have the depend on service activated on purpose. When we disabled the item key that monitored that service, the trigger stop showing in the trigger list as intended. Yet, we were surprised when the dependent service failed and we couldn't see it either: took us time to remember the disabled "depend on" item key.

Comment by Oleksii Zagorskyi [ 2012 Nov 21 ]

See also ZBX-5796, it's very similar and there is some my idea

Comment by Alexander [ 2014 Jan 22 ]

Cross-posting (which I don't like, but that bug, ZBX-5796, gets much less attention).

A would support John here - to me it looks rather like a feature not working, i.e. bug, rather then a feature request.

There's a promise that dependent triggers are suppressed if the dependency fires and that makes a lot of sense for those "bunch of machines behind a router" or "bunch of services on a host" cases. But it definitely means the email (or all?) actions should be suppressed too, because it's just a common sense - there's no value in suppressed trigger notification view in the frontend if you get 100 emails/messages.

Comment by Adam Bond [ 2014 Feb 13 ]

Same issue for me as the others have posted. My specific instance is for monitoring external WAN IP and Internal VPN IP addresses. VPN IP address is dependent on WAN IP being up. When they both go down, I can acknowledge the WAN trigger, but not the VPN trigger so then escalations will fire off on VPN trigger even though it is not necessary.

Comment by Dave Allaby [ 2014 Mar 21 ]

I am experiencing the same type of issue .. I have Zabbix->Router->Router->Router->Server in some instances and when I loose the first connection I am notified of all below.. Although in the web I only see the root cause.. Which is bad since I am forwarding the notifications to staff that don't even have access to the web so they are very confused..
Also I am on ver2.0.9
I have been suggested to adjust timings but to me that takes a very clean setup and complicates it as I have to adjust host timings that were set by a nice clean template.. To me this adds alot of complexity to a problem that should be addressed as a bug.. Since the dependancies work properly in the web..

Comment by Anton Samets [ 2014 Mar 23 ]

We also have this issue.
trigger is alarming, escalation is working, then trigger with dependency's appear, in GUI all looks good, but escalations are still going.

Comment by Alexander T [ 2014 Apr 08 ]

I support a call for fixing this nagging behavior inconsistency causing false positive emails.

Comment by richlv [ 2014 Jun 02 ]

comment from heaje on irc :

I'd say that the escalation for the "child" trigger should be canceled

Comment by Corey Shaw [ 2014 Jun 06 ]

Perhaps with the child trigger escalation being cancelled, the child trigger should also be forced to the OK state. I say this because it is entirely possible that the parent trigger could clear, but there could still be an issue with the child.

With the way things currently work, cancelling the escalation on the child trigger and leaving the trigger in the PROBLEM state could result in this scenario:
1. Let's say the child trigger becomes a PROBLEM and escalations start.
2. The parent trigger goes to the PROBLEM state later and the escalations for the child are cancelled.
3. Later the parent trigger goes to the OK state.
4. The child trigger still detects that there is an issue and stays in the PROBLEM state.
5. Since the child was already in the PROBLEM state, actions will not be re-evaluated for the child. As a result, no escalations will occur for the child issue.

As mentioned earlier, I propose that child triggers be forced to the OK state (or maybe even a new kind of state) when their parent is in the PROBLEM state.

Comment by Ryan Armstrong [ 2014 Sep 02 ]

From ZBX-8667:
Perhaps I'm looking in the wrong place, but in the `check_escalation` function in `src/zabbix_server/escalator/escalator.c:check_escalation` (v2.2.5) I can see the server checks that the host, item and trigger are still valid but I don't see any reference to dependencies?

Comment by Aleksandrs Saveljevs [ 2014 Sep 16 ]

Issue ZBX-6174 is related.

Comment by Ryan Armstrong [ 2014 Sep 24 ]

I've tested this now in a virgin Zabbix 2.2.6 and 2.4 environment and the escalated actions still fire even if antecedent triggers are in a problem state. This does not function as documented in https://www.zabbix.com/wiki/doku.php?id=howto/config/alerts/delaying_notifications

For testing I created three hosts, 'server', 'vhost' and 'router'. All three have a single ping check and a single trigger if packet loss occurs.
I created three virtual IPs on the Zabbix server, one for each pretend server.
I created a single action to dump the alert message to a log file, as escalation step 2, after 2 minutes.

In initial testing everything worked perfectly because all three ping items fired at the same time and I assume Zabbix processed the results and resolved dependencies as the result come in a batch from 'fping'.

Our live environment currently pings 2100 servers and the pings do not happen in a single batch, but are instead staggered over the check interval. We have no control over the order of the staggered checks.

To replicate our live environment I changed the test server ping intervals to Server: 10s, VHost: 30s, Router: 60s.

Now if I drop all three interfaces together (like a site outage) the following happens (in chronological order) in the Zabbix interface:

  • In 'Events' an event is created for the Server
  • In 'Triggers' the Server ping trigger appears
  • A few seconds later an event is created for the VHost. The event for the Server remains in a Problem state (bad?)
  • In 'Triggers' the Server trigger disappears and is replaced by the VHost trigger (correct)
  • A few seconds later an event is created for the Router. The previous two event remains in a Problem state (bad?)
  • In 'Triggers' the VHost trigger is replaced by the Router trigger (nice!)
  • After the escalation period for Step 1 expires on the Server trigger, the Step 2 action is executed despite the Router and VHost being in a problem state (bad!)
  • The Step 2 action executes for the VHost (bad)...
  • The step 2 action executes for the Router (good)

Ideally I feel the dependencies feature should work without the need for escalations but I'd be so happy if we could resolve this issue and get it working!

I know Nagios will re-execute all antecedent triggers (ignoring the schedule) before placing a dependent trigger into a problem state. This can cause check storms if there is a web of dependencies, but could this be a worth while direction? Or even simply forcing a dependent trigger to wait for the next scheduled parent check before entering a problem state? Or maybe an extra check of the parent trigger states when an escalation is about to fire?

Comment by Ryan Armstrong [ 2014 Oct 10 ]

The following patch works around this issue nicely for us (v2.2.6 - 7, doesn't work in 2.4.0):

diff -Naur zabbix-2.2.6/src/zabbix_server/escalator/escalator.c zabbix-patch/src/zabbix_server/escalator/escalator.c
--- zabbix-2.2.6/src/zabbix_server/escalator/escalator.c	2014-08-27 13:07:23.000000000 +0000
+++ zabbix-patch/src/zabbix_server/escalator/escalator.c	2014-10-09 02:46:38.644911081 +0000
@@ -1295,6 +1295,15 @@
 	}
 	DBfree_result(result);
 
+	if(EVENT_OBJECT_TRIGGER == object && FAIL == DCconfig_check_trigger_dependencies(escalation->triggerid)) {
+	
+	/* dependencies in problem state? */
+	
+		*error = zbx_dsprintf(*error, "An antecedent trigger of trigger %u is in a problem state\n%s", escalation->triggerid, error);
+		if(NULL != action)
+			action->actionid = 0;
+	}
+
 	zabbix_log(LOG_LEVEL_DEBUG, "End of %s() error:'%s'", __function_name, ZBX_NULL2STR(*error));
 }

It simply adds a dependency check prior to the execution of an escalation step. If an antecedent trigger is in a problem state, the escalation is cancelled and logged as a warning in the server log.

There are issues with this patched:

  • The dependency triggers/items/hosts are not checked to see if they have been disabled or deleted
  • If a dependency trigger exits a problem state, but the dependent is still in a problem state, the escalation may have already been cancelled and won't proceed.
Comment by callistix [ 2014 Nov 17 ]

I can confirm that this workaround doesn't work: https://www.zabbix.com/wiki/doku.php?id=howto/config/alerts/delaying_notifications
Are there any plans for a fix in version >2.4.2 ?
Currently trigger dependencies seem pretty much broken.

Comment by Maris Danne [ 2014 Nov 26 ]

We have same problem affects versions: 2.2.4, 2.2.7 and 2.4.2.
If child host gets problem value first (than parent), then dependencies won't work even if there is escalation, so the trigger dependencies is useless by now.

Fix versions???

Comment by Petr Petrovic [ 2015 Feb 01 ]

Same problem here. (Zabbix 2.4.3). Without working dependencies is Zabbix useless in our situation, when we need to monitor (and get info about problems) approx. 1000 dependent devices.

Is there any plan to fix this issue?

Zabbix has fundamental bug that have been around for more than 3 years and it seems that nobody cares.

Comment by richlv [ 2015 Apr 27 ]

ZBX-9523 is related

Comment by Felix Meier [ 2015 May 22 ]

We have the same problem.

Triggers that have been deactivated by a dependency (for example the IMAP/POP/HTTP/SMTP triggers depend on the ICMP trigger of a host and the ICMP trigger is up) will anyway start a action and the full escalation of the action will be executed. This only happens, if the IMAP/POP3/SMTP/etc triggers are up before the ICMP trigger is activated.

I think i can reproduce it, if needed.

Comment by dimir [ 2015 Jun 16 ]

We are planning to fix this issue in trunk according to this specification. Please review it and make sure to leave a comment here if you have any questions, ideas or disagreements on anything.

Comment by Felix Meier [ 2015 Jun 16 ]

Sounds good. This is all we need

Comment by Corey Shaw [ 2015 Jun 16 ]

One aspect of the specification for this fix concerns me. The part that states " If such a trigger is found the original trigger must be enforced to value OK. If its previous value was PROBLEM a recovery event must be generated. " I fear could end up causing confusion in the long run.

Sending a recovery for a trigger that depends on another would make it seem as if the trigger did in fact recover (because a recovery was sent). In this case, the trigger has NOT recovered. We just wanted to pause its escalations because they aren't important until the parent trigger is fixed. A recovery event is simply a lie.

If possible, perhaps it would be better to "pause" the escalations for triggers that depend on a parent trigger in a PROBLEM state. When the parent trigger goes back to OK, the escalations for the child trigger can pick up right where they left off because it is entirely possible that the child still has issues after the parent has been resolved.

After reading my previous comments on this ticket, I realize that last year I mentioned perhaps doing exactly what the specification mentions. This issue I just brought up I believe prevents forcing an OK status from being the best solution.

Comment by dimir [ 2015 Jun 17 ]

Yeah, we considered "freezing" escalations at first but this solution looked too complicated, besides we thought generating an OK event would be more clear and "visual" to a user. But now as we discussed it again we see the "freezing" solution might be better. Please consider the updated doc.

Comment by dimir [ 2015 Jun 19 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-4344

Comment by Andris Zeila [ 2015 Jul 03 ]

Successfully tested

Comment by dimir [ 2015 Jul 27 ]

Fixed in pre-2.5.0 r54555.

Comment by Kay Schroeder [ 2016 Jan 29 ]

Will this issue be included in 3.0? Have not found it in the issue list of 3.0 Beta 1 or 2.

Comment by richlv [ 2016 Jan 30 ]

it was added during the alpha cycle - see the full changelog from svn checkout or at https://www.zabbix.org/websvn/wsvn/zabbix.com/trunk/ChangeLog

Generated at Thu Apr 25 07:23:46 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.