[#ZBX-8175] Escalations become "stuck" and keep going after triggers recover (rarely, at random)

[ZBX-8175] Escalations become "stuck" and keep going after triggers recover (rarely, at random) Created: 2014 May 06 Updated: 2019 Dec 10
Status:	Open
Project:	ZABBIX BUGS AND ISSUES
Component/s:	Server (S)
Affects Version/s:	2.2.2
Fix Version/s:	None

Type:

Incident report

Priority:

Trivial

Reporter:

Corey Shaw

Assignee:

Unassigned

Resolution:

Unresolved

Votes:

Labels:

escalations

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original Estimate:

Not Specified

Issue Links:

Duplicate
is duplicated by	~~ZBX-9728~~	Escalations don't stop after disablin...	Closed

Description

Occasionally an escalation will appear to get "stuck" in an active state even after the trigger associated with it has long since recovered. Most commonly, this results in continual emails/alerts from Zabbix even though there is no longer a problem. The only way that I've found to resolve it is to delete the row in the escalations table for that escalation. Another user reported to volter in #zabbix that they fixed it by disabling and then re-enabling the associated trigger.

I have not seen this issue for myself in quite a while, but I have seen it come up #zabbix on freenode several times.

Comments

Comment by Aleksandrs Saveljevs [ 2014 May 06 ]

One related problem with escalations was fixed in ~~ZBX-7484~~, so it should work properly in version 2.2.2.

Apparently, it does not work perfectly yet. Could you please provide more information about the problem? For instance, how much time passes between a PROBLEM event and an OK event? Is the system loaded heavily? How often does the problem occur and would it be possible to run with DebugLevel=4 for a while?

Comment by Anton Samets [ 2014 May 06 ]

maybe ~~ZBX-7825~~?

Comment by Volker Fröhlich [ 2014 May 06 ]

I can't see how it could be connected.

Comment by Corey Shaw [ 2014 May 06 ]

Unfortunately (fortunately?), I haven't seen the issue myself in quite a while, so I don't have any decent data to give, however, it appears to keep happening to other people. As people in #zabbix bring up the issue, I'll direct them to this ticket to give the requested information.

Comment by Elvar [ 2014 May 12 ]

I have had this happen twice in the last two weeks. I'll try to get more information next time but I'll give you what I have. I started receiving alerts for the following trigger "icmpping[,1].nodata(3m)}=1" which pings about 10 hosts on a remote network via a zabbix proxy. Despite the condition clearing up rather quickly I continued receiving alerts every hour as if I had not acknowledged the alert. When I went to acknowledge initially I saw that the problem had already cleared so there was nothing to acknowledge. Both instances that this happened in the last two weeks the hosts were behind proxies. The load on both the main zabbix server and that proxy was very low. Again, I'll try and get more information next time. I'm running Zabbix version 2.2.2 on both the main server and the proxy involved.

The way I stopped the alerts from coming through was by disabling the template with the trigger in question which resulted in the following...

NOTE: Escalation cancelled: trigger 'No icmp data received in the last 3 min' disabled.
Host: UNKNOWN

Comment by Elvar [ 2014 May 12 ]

The other interesting thing is that I do not see a single problem event at the time when the alert came for nodata. 2.5 hours earlier before this started the site where the main zabbix server sits was offline for about an hour.

Comment by Fernando Schmitt [ 2014 Nov 06 ]

Still a problem in 2.4.1. Also. disabling and enabling the affected triggers won't work.

Comment by Ronald Schaten [ 2014 Dec 16 ]

I see the problem in 2.4.2, the only remedy seems to be "delete from escalations" in the database.

We upgraded fom 2.0.4 to 2.4.2 two weeks ago. The problem didn't exist in 2.0.4. Since the upgrade it happened five or six times. And it happens always with the same trigger on the same host. So, this problem exists only for one of our >500 hosts (almost all of them sitting behind proxys). There are three other hosts that are configured similar to the one being affected by the problem (same OS, same agent, behind the same proxy), but they are not affected.

Comment by Ronald Schaten [ 2014 Dec 16 ]

Oh, another info that might be helpful: the said systems (all four of them) are rebooted daily, within a maintenance period. The faulty messages start at the end of the maintenance period.

Comment by Aleksandrs Saveljevs [ 2014 Dec 16 ]

If proxies are involved, then ~~ZBX-8873~~ might fix your problem.

Comment by Ronald Schaten [ 2014 Dec 16 ]

Thanks for the comment. Yes, that looks like our problem. You suggest an update to 2.4.3 would fix this?

Comment by Aleksandrs Saveljevs [ 2014 Dec 16 ]

Yes. Note that only the server upgrade is necessary - no need to upgrade proxies. Please let us know your results after the upgrade.

Comment by Ronald Schaten [ 2014 Dec 16 ]

Thanks again. I'll try, and I'll report. But that will take a few days...

Comment by Ronald Schaten [ 2015 Jan 29 ]

Looks as if the update actually solved the problem. While it happened three times within two weeks of running 2.4.2, it didn't happen even once in four weeks of running 2.4.3. To me, it looks as if the problem is fixed.

Comment by jaseywang [ 2015 Mar 06 ]

Quite easy to reproduce:
action -> Operations -> Operation details -> Step, and set the form "To" to 0(infinitely), it will keep going sending the same recovery message forever.
Another workaround is to set the steps(from - to) to a smaller value, say 3.
Happened on version 2.0.5.

Comment by richlv [ 2015 Mar 06 ]

if your escalations are processing recovery messages, maybe you are missing "trigger value = problem" condition ?

Comment by jaseywang [ 2015 Mar 07 ]

@richlv, I have set that condition.

Comment by richlv [ 2015 Jul 24 ]

jaseywang, but in that case there should be no recovery message at all. in your case i'd suspect action misconfiguration still

Comment by richlv [ 2015 Jul 24 ]

~~ZBX-9728~~ suggests that disabling an action could sometimes lead to such a situation. those who have seen this problem, are you disabling actions every now and then, maybe ?

Comment by Haralds Jakovels [ 2015 Sep 11 ]

I can confirm that the problem is still in 2.4.6. As there were 46 triggers, and 46 e-mails sent every 10 minutes, going thru the list and manully enabling/disabling something would not be convenient/fast, I wiped out them from db table "escalations".

mysql -pcensored -e "use zabbix; delete from escalations where esc_step=26;select row_count();"

as described in https://www.zabbix.com/forum/showthread.php?t=38521

happened after cluster service relocation for the first time, so this is quite rare. Relocation was done like 20-30 times before.

Comment by richlv [ 2016 Apr 15 ]

ZBX-7200 talks about sending out a bit too many messages when an action is disabled, but that does not seem to be related to the problem, described here

Generated at Thu Apr 18 21:56:18 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.

[ZBX-8175] Escalations become "stuck" and keep going after triggers recover (rarely, at random) Created: 2014 May 06 Updated: 2019 Dec 10