[ZBX-4344] dependent event stuck in escalations Created: 2011 Nov 10 Updated: 2017 May 30 Resolved: 2015 Jul 29 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 1.8.8 |
Fix Version/s: | 2.5.0 |
Type: | Incident report | Priority: | Major |
Reporter: | Robert Jerzak | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 27 |
Labels: | dependencies, escalations, triggerdependencies | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Issue Links: |
|
Description |
We have 2 items:
The host went down. Trigger "server unreachable (icmp)" lights up as supposed to. Administrator ACK this trigger after he received SMS with this trigger. There was no "Deadlock found" in zabbix log related to this issue. It is not easy to replicate this issue. It happens from time to time. Specification: http://www.zabbix.org/wiki/Docs/specs/ZBX-4344 |
Comments |
Comment by richlv [ 2011 Nov 10 ] |
are there trigger dependencies set ? can you see this trigger in monitoring -> triggers ? what if you set trigger status in filter to "any" ? edit : duh, right, there are dependencies involved... |
Comment by richlv [ 2011 Nov 11 ] |
ok, so if i read this right, it might "work as expected" in general, one should configure triggers so the more important ones (or the ones other triggers depend on) fire first - so in this case, making agent.ping less sensitive than icmpping. that would ensure agent.ping firing after icmpping if the whole system goes down. although there probably should be some safeguard or mechanism to deal with dependent triggers having runaway escalations that can not be easily stopped |
Comment by Robert Jerzak [ 2011 Nov 11 ] |
Sorry for my English Having this two above items, when I ACK "server unreachable (icmp)" should it mute all depended triggers (including dependent "server unreachable (agent)")? In GUI it works ok. In escalations seems to be something wrong. |
Comment by Igor Danoshaites (Inactive) [ 2011 Nov 15 ] |
>>Having this two above items, when I ACK "server unreachable (icmp)" should it mute all depended triggers (including dependent "server unreachable (agent)")? Yes, if you have set trigger dependencies correctly, then this should mute ALL depended triggers. You can also check the content of your "escalations" table: select * from escalations; |
Comment by Martin Brassard [ 2012 Mar 13 ] |
We have some dependecies where this situation occurs: escalation continues when "depend on" trigger is in problem. For example, we have a warning severity trigger if the hard drive space is at 80% and a high severity trigger at 90%. The warning trigger is dependent on the high severity one since it is obvious that both will be up if the later is true. Using this configuration, we ran into the situation reported: since the 80% trigger arrived before the 90% trigger, it fires first and start sending notifications. After some time, the 90% trigger fires off and the 80% is no longer visible in the trigger list, as expected. Yet, we continued receiving the warning severity notifications. Acknowledging the high severity problem did stop the escalation for the warning severity, but it would be good if the escalation looked at the dependent trigger before sending notification. Since the situation is pretty close, I posted here instead of raising a new ticket for enhancement. Please inform if I should. |
Comment by Ghozlane TOUMI [ 2012 Mar 28 ] |
In fact It seems that the escalation system don't account for trigger dependencies at each "operation". it only takes the trigger state when it fires example: A depends on B, and A fires before B in the gui, you'll see trigger A firing, and when B's items is checked, trigger dependency works correctly, and only B will appear. The escalation system should check trigger dependencies *before each operation* |
Comment by richlv [ 2012 Apr 10 ] |
might be related to |
Comment by Lucian Atody [ 2012 Apr 23 ] |
Any news about this, please? |
Comment by Corey Shaw [ 2012 May 19 ] |
I have this exact same problem in 2.0.0rc4. Trigger A depends on Trigger B. If Trigger A goes off first it starts the action escalation process (my action sends an alert 5 minutes after the trigger went off). If Trigger B goes off, Trigger A disappears from the list of problems in the dashboard (like it should), but the action continues to send alerts for Trigger A (even though Trigger B trumps Trigger A). |
Comment by Corey Shaw [ 2012 Jun 22 ] |
Has there been any movement on this issue at all? This is killing us right now. It's preventing us from switching to Zabbix 100% from Nagios. I'm running on the stable 2.0.0 and am still experiencing this problem. |
Comment by richlv [ 2012 Jun 22 ] |
can anybody provide exact steps to reproduce what the problem is supposed to be ? |
Comment by Corey Shaw [ 2012 Jun 24 ] |
Here's the basic steps: 1. Have a host to monitor that listens on port 80 and runs httpd (apache). What I expect to have happen is that the escalations in #11 would stop happening since the dependent trigger went off. |
Comment by Eugene Istomin [ 2012 Jun 28 ] |
We have the same problem. Seems that Corey Shaw steps to reproduce have the same logic we have in our system. |
Comment by Martin Brassard [ 2012 Aug 01 ] |
I am not sure if this occurs on a fresh install, but we tried to disable the "depend on" trigger in hope that we would see the dependent trigger. This didn't work: dependent trigger was still not visible, but "depend on" trigger was also removed. We also encountered this behavior by chance on a live machine and it caused us some headaches. We used a template and had one machine that didn't have the depend on service activated on purpose. When we disabled the item key that monitored that service, the trigger stop showing in the trigger list as intended. Yet, we were surprised when the dependent service failed and we couldn't see it either: took us time to remember the disabled "depend on" item key. |
Comment by Oleksii Zagorskyi [ 2012 Nov 21 ] |
See also |
Comment by Alexander [ 2014 Jan 22 ] |
Cross-posting (which I don't like, but that bug, A would support John here - to me it looks rather like a feature not working, i.e. bug, rather then a feature request. There's a promise that dependent triggers are suppressed if the dependency fires and that makes a lot of sense for those "bunch of machines behind a router" or "bunch of services on a host" cases. But it definitely means the email (or all?) actions should be suppressed too, because it's just a common sense - there's no value in suppressed trigger notification view in the frontend if you get 100 emails/messages. |
Comment by Adam Bond [ 2014 Feb 13 ] |
Same issue for me as the others have posted. My specific instance is for monitoring external WAN IP and Internal VPN IP addresses. VPN IP address is dependent on WAN IP being up. When they both go down, I can acknowledge the WAN trigger, but not the VPN trigger so then escalations will fire off on VPN trigger even though it is not necessary. |
Comment by Dave Allaby [ 2014 Mar 21 ] |
I am experiencing the same type of issue .. I have Zabbix->Router->Router->Router->Server in some instances and when I loose the first connection I am notified of all below.. Although in the web I only see the root cause.. Which is bad since I am forwarding the notifications to staff that don't even have access to the web so they are very confused.. |
Comment by Anton Samets [ 2014 Mar 23 ] |
We also have this issue. |
Comment by Alexander T [ 2014 Apr 08 ] |
I support a call for fixing this nagging behavior inconsistency causing false positive emails. |
Comment by richlv [ 2014 Jun 02 ] |
comment from heaje on irc :
|
Comment by Corey Shaw [ 2014 Jun 06 ] |
Perhaps with the child trigger escalation being cancelled, the child trigger should also be forced to the OK state. I say this because it is entirely possible that the parent trigger could clear, but there could still be an issue with the child. With the way things currently work, cancelling the escalation on the child trigger and leaving the trigger in the PROBLEM state could result in this scenario: As mentioned earlier, I propose that child triggers be forced to the OK state (or maybe even a new kind of state) when their parent is in the PROBLEM state. |
Comment by Ryan Armstrong [ 2014 Sep 02 ] |
From |
Comment by Aleksandrs Saveljevs [ 2014 Sep 16 ] |
Issue |
Comment by Ryan Armstrong [ 2014 Sep 24 ] |
I've tested this now in a virgin Zabbix 2.2.6 and 2.4 environment and the escalated actions still fire even if antecedent triggers are in a problem state. This does not function as documented in https://www.zabbix.com/wiki/doku.php?id=howto/config/alerts/delaying_notifications For testing I created three hosts, 'server', 'vhost' and 'router'. All three have a single ping check and a single trigger if packet loss occurs. In initial testing everything worked perfectly because all three ping items fired at the same time and I assume Zabbix processed the results and resolved dependencies as the result come in a batch from 'fping'. Our live environment currently pings 2100 servers and the pings do not happen in a single batch, but are instead staggered over the check interval. We have no control over the order of the staggered checks. To replicate our live environment I changed the test server ping intervals to Server: 10s, VHost: 30s, Router: 60s. Now if I drop all three interfaces together (like a site outage) the following happens (in chronological order) in the Zabbix interface:
Ideally I feel the dependencies feature should work without the need for escalations but I'd be so happy if we could resolve this issue and get it working! I know Nagios will re-execute all antecedent triggers (ignoring the schedule) before placing a dependent trigger into a problem state. This can cause check storms if there is a web of dependencies, but could this be a worth while direction? Or even simply forcing a dependent trigger to wait for the next scheduled parent check before entering a problem state? Or maybe an extra check of the parent trigger states when an escalation is about to fire? |
Comment by Ryan Armstrong [ 2014 Oct 10 ] |
The following patch works around this issue nicely for us (v2.2.6 - 7, doesn't work in 2.4.0): diff -Naur zabbix-2.2.6/src/zabbix_server/escalator/escalator.c zabbix-patch/src/zabbix_server/escalator/escalator.c --- zabbix-2.2.6/src/zabbix_server/escalator/escalator.c 2014-08-27 13:07:23.000000000 +0000 +++ zabbix-patch/src/zabbix_server/escalator/escalator.c 2014-10-09 02:46:38.644911081 +0000 @@ -1295,6 +1295,15 @@ } DBfree_result(result); + if(EVENT_OBJECT_TRIGGER == object && FAIL == DCconfig_check_trigger_dependencies(escalation->triggerid)) { + + /* dependencies in problem state? */ + + *error = zbx_dsprintf(*error, "An antecedent trigger of trigger %u is in a problem state\n%s", escalation->triggerid, error); + if(NULL != action) + action->actionid = 0; + } + zabbix_log(LOG_LEVEL_DEBUG, "End of %s() error:'%s'", __function_name, ZBX_NULL2STR(*error)); } It simply adds a dependency check prior to the execution of an escalation step. If an antecedent trigger is in a problem state, the escalation is cancelled and logged as a warning in the server log. There are issues with this patched:
|
Comment by callistix [ 2014 Nov 17 ] |
I can confirm that this workaround doesn't work: https://www.zabbix.com/wiki/doku.php?id=howto/config/alerts/delaying_notifications |
Comment by Maris Danne [ 2014 Nov 26 ] |
We have same problem affects versions: 2.2.4, 2.2.7 and 2.4.2. Fix versions??? |
Comment by Petr Petrovic [ 2015 Feb 01 ] |
Same problem here. (Zabbix 2.4.3). Without working dependencies is Zabbix useless in our situation, when we need to monitor (and get info about problems) approx. 1000 dependent devices. Is there any plan to fix this issue? Zabbix has fundamental bug that have been around for more than 3 years and it seems that nobody cares. |
Comment by richlv [ 2015 Apr 27 ] |
|
Comment by Felix Meier [ 2015 May 22 ] |
We have the same problem. Triggers that have been deactivated by a dependency (for example the IMAP/POP/HTTP/SMTP triggers depend on the ICMP trigger of a host and the ICMP trigger is up) will anyway start a action and the full escalation of the action will be executed. This only happens, if the IMAP/POP3/SMTP/etc triggers are up before the ICMP trigger is activated. I think i can reproduce it, if needed. |
Comment by dimir [ 2015 Jun 16 ] |
We are planning to fix this issue in trunk according to this specification. Please review it and make sure to leave a comment here if you have any questions, ideas or disagreements on anything. |
Comment by Felix Meier [ 2015 Jun 16 ] |
Sounds good. This is all we need |
Comment by Corey Shaw [ 2015 Jun 16 ] |
One aspect of the specification for this fix concerns me. The part that states " If such a trigger is found the original trigger must be enforced to value OK. If its previous value was PROBLEM a recovery event must be generated. " I fear could end up causing confusion in the long run. Sending a recovery for a trigger that depends on another would make it seem as if the trigger did in fact recover (because a recovery was sent). In this case, the trigger has NOT recovered. We just wanted to pause its escalations because they aren't important until the parent trigger is fixed. A recovery event is simply a lie. If possible, perhaps it would be better to "pause" the escalations for triggers that depend on a parent trigger in a PROBLEM state. When the parent trigger goes back to OK, the escalations for the child trigger can pick up right where they left off because it is entirely possible that the child still has issues after the parent has been resolved. After reading my previous comments on this ticket, I realize that last year I mentioned perhaps doing exactly what the specification mentions. This issue I just brought up I believe prevents forcing an OK status from being the best solution. |
Comment by dimir [ 2015 Jun 17 ] |
Yeah, we considered "freezing" escalations at first but this solution looked too complicated, besides we thought generating an OK event would be more clear and "visual" to a user. But now as we discussed it again we see the "freezing" solution might be better. Please consider the updated doc. |
Comment by dimir [ 2015 Jun 19 ] |
Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-4344 |
Comment by Andris Zeila [ 2015 Jul 03 ] |
Successfully tested |
Comment by dimir [ 2015 Jul 27 ] |
Fixed in pre-2.5.0 r54555. |
Comment by Kay Schroeder [ 2016 Jan 29 ] |
Will this issue be included in 3.0? Have not found it in the issue list of 3.0 Beta 1 or 2. |
Comment by richlv [ 2016 Jan 30 ] |
it was added during the alpha cycle - see the full changelog from svn checkout or at https://www.zabbix.org/websvn/wsvn/zabbix.com/trunk/ChangeLog |