[ZBX-8175] Escalations become "stuck" and keep going after triggers recover (rarely, at random) Created: 2014 May 06 Updated: 2019 Dec 10 |
|
Status: | Open |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 2.2.2 |
Fix Version/s: | None |
Type: | Incident report | Priority: | Trivial |
Reporter: | Corey Shaw | Assignee: | Unassigned |
Resolution: | Unresolved | Votes: | 7 |
Labels: | escalations | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Issue Links: |
|
Description |
Occasionally an escalation will appear to get "stuck" in an active state even after the trigger associated with it has long since recovered. Most commonly, this results in continual emails/alerts from Zabbix even though there is no longer a problem. The only way that I've found to resolve it is to delete the row in the escalations table for that escalation. Another user reported to volter in #zabbix that they fixed it by disabling and then re-enabling the associated trigger. I have not seen this issue for myself in quite a while, but I have seen it come up #zabbix on freenode several times. |
Comments |
Comment by Aleksandrs Saveljevs [ 2014 May 06 ] |
One related problem with escalations was fixed in Apparently, it does not work perfectly yet. Could you please provide more information about the problem? For instance, how much time passes between a PROBLEM event and an OK event? Is the system loaded heavily? How often does the problem occur and would it be possible to run with DebugLevel=4 for a while? |
Comment by Anton Samets [ 2014 May 06 ] |
maybe |
Comment by Volker Fröhlich [ 2014 May 06 ] |
I can't see how it could be connected. |
Comment by Corey Shaw [ 2014 May 06 ] |
Unfortunately (fortunately?), I haven't seen the issue myself in quite a while, so I don't have any decent data to give, however, it appears to keep happening to other people. As people in #zabbix bring up the issue, I'll direct them to this ticket to give the requested information. |
Comment by Elvar [ 2014 May 12 ] |
I have had this happen twice in the last two weeks. I'll try to get more information next time but I'll give you what I have. I started receiving alerts for the following trigger "icmpping[,1].nodata(3m)}=1" which pings about 10 hosts on a remote network via a zabbix proxy. Despite the condition clearing up rather quickly I continued receiving alerts every hour as if I had not acknowledged the alert. When I went to acknowledge initially I saw that the problem had already cleared so there was nothing to acknowledge. Both instances that this happened in the last two weeks the hosts were behind proxies. The load on both the main zabbix server and that proxy was very low. Again, I'll try and get more information next time. I'm running Zabbix version 2.2.2 on both the main server and the proxy involved. The way I stopped the alerts from coming through was by disabling the template with the trigger in question which resulted in the following... NOTE: Escalation cancelled: trigger 'No icmp data received in the last 3 min' disabled. |
Comment by Elvar [ 2014 May 12 ] |
The other interesting thing is that I do not see a single problem event at the time when the alert came for nodata. 2.5 hours earlier before this started the site where the main zabbix server sits was offline for about an hour. |
Comment by Fernando Schmitt [ 2014 Nov 06 ] |
Still a problem in 2.4.1. Also. disabling and enabling the affected triggers won't work. |
Comment by Ronald Schaten [ 2014 Dec 16 ] |
I see the problem in 2.4.2, the only remedy seems to be "delete from escalations" in the database. We upgraded fom 2.0.4 to 2.4.2 two weeks ago. The problem didn't exist in 2.0.4. Since the upgrade it happened five or six times. And it happens always with the same trigger on the same host. So, this problem exists only for one of our >500 hosts (almost all of them sitting behind proxys). There are three other hosts that are configured similar to the one being affected by the problem (same OS, same agent, behind the same proxy), but they are not affected. |
Comment by Ronald Schaten [ 2014 Dec 16 ] |
Oh, another info that might be helpful: the said systems (all four of them) are rebooted daily, within a maintenance period. The faulty messages start at the end of the maintenance period. |
Comment by Aleksandrs Saveljevs [ 2014 Dec 16 ] |
If proxies are involved, then |
Comment by Ronald Schaten [ 2014 Dec 16 ] |
Thanks for the comment. Yes, that looks like our problem. You suggest an update to 2.4.3 would fix this? |
Comment by Aleksandrs Saveljevs [ 2014 Dec 16 ] |
Yes. Note that only the server upgrade is necessary - no need to upgrade proxies. Please let us know your results after the upgrade. |
Comment by Ronald Schaten [ 2014 Dec 16 ] |
Thanks again. I'll try, and I'll report. But that will take a few days... |
Comment by Ronald Schaten [ 2015 Jan 29 ] |
Looks as if the update actually solved the problem. While it happened three times within two weeks of running 2.4.2, it didn't happen even once in four weeks of running 2.4.3. To me, it looks as if the problem is fixed. |
Comment by jaseywang [ 2015 Mar 06 ] |
Quite easy to reproduce: |
Comment by richlv [ 2015 Mar 06 ] |
if your escalations are processing recovery messages, maybe you are missing "trigger value = problem" condition ? |
Comment by jaseywang [ 2015 Mar 07 ] |
@richlv, I have set that condition. |
Comment by richlv [ 2015 Jul 24 ] |
jaseywang, but in that case there should be no recovery message at all. in your case i'd suspect action misconfiguration still |
Comment by richlv [ 2015 Jul 24 ] |
|
Comment by Haralds Jakovels [ 2015 Sep 11 ] |
I can confirm that the problem is still in 2.4.6. As there were 46 triggers, and 46 e-mails sent every 10 minutes, going thru the list and manully enabling/disabling something would not be convenient/fast, I wiped out them from db table "escalations". mysql -pcensored -e "use zabbix; delete from escalations where esc_step=26;select row_count();" as described in https://www.zabbix.com/forum/showthread.php?t=38521 happened after cluster service relocation for the first time, so this is quite rare. Relocation was done like 20-30 times before. |
Comment by richlv [ 2016 Apr 15 ] |
ZBX-7200 talks about sending out a bit too many messages when an action is disabled, but that does not seem to be related to the problem, described here |