[ZBX-7484] Escalations still escalate even when trigger is OK Created: 2013 Dec 04 Updated: 2017 May 30 Resolved: 2014 Apr 08 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 2.0.10rc1, 2.2.1rc1 |
Fix Version/s: | 2.0.11rc1, 2.2.2rc1, 2.3.0 |
Type: | Incident report | Priority: | Major |
Reporter: | Alexey Pustovalov | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 1 |
Labels: | escalations | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: | performance.png | ||||||||
Issue Links: |
|
Description |
It happens only on big installation and can be only with triggers with a few items in expression or one item (and time related function). |
Comments |
Comment by Aleksandrs Saveljevs [ 2013 Dec 09 ] |
The problem seems to happen on very busy systems where one trigger is processed simultaneously by two processes (e.g., two history syncer processes or one history syncer and one timer). Namely, consider function process_actions() in src/zabbix_server/actions.c. Suppose one process, call it A, discovered that a trigger is in a PROBLEM state and starts an escalation. It then goes on processing other events. At the same time, process B, processes another value used in that trigger and discovered that the trigger became OK after that PROBLEM. It sets the trigger to OK, inserts the event, and does select from the database to see whether there are any started escalations (actions.c:1264). However, since A is busy and did not commit its transaction yet, B does not see the escalation started by A, and does not stop the escalation. So, in the end, we have that the trigger is correctly in OK state, events are correctly generated, escalation is started, but is not stopped. |
Comment by Aleksandrs Saveljevs [ 2013 Dec 17 ] |
Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-7484 . The main idea behind the fix is that timers and history syncers now lock triggers they are processing and such locking happens in configuration cache. This has the effect that two history syncers cannot simultaneously process two items that are attached to the same trigger. Another effect is that now timers will skip processing of triggers that are already being processed by history syncers. The idea is that history syncers will evaluate time functions, too, so there is no point in separate processing by timers. Another change worth mentioning is that now timer processes will take triggers to process in batches, instead of all triggers at once. I have tested the changes on 2000 hosts linked to "Template OS Linux", augmented with a bunch of time-based triggers. The following graph shows the effect the changes had on performance:
Either there is a bug, or changes have been positive. |
Comment by richlv [ 2013 Dec 17 ] |
(1) that looks really great - let's include performance improvement in appropriate whatsnew asaveljevs Documented that at https://www.zabbix.com/documentation/2.0/manual/introduction/whatsnew2011?&#daemon_improvements . RESOLVED. <richlv> as discussed on irc, also added minor clarification about processes that process triggers -> CLOSED asaveljevs Documented that at https://www.zabbix.com/documentation/2.2/manual/introduction/whatsnew222?&#daemon_improvements , too. <richlv> thanks, CLOSED zalex_ua I don't like how we described the changes. <richlv> hmm, i don't really like introduction of "complex triggers", we would have to define that. also, is that correct ? i suppose timer could collide with hsyncer even on triggers that do not reference multiple items. as for only one process overall processing triggers, i wouldn't read it as that - see the "at a time" part, does that make the sentence clear ? zalex_ua ok, I agree to not use "complex" term. It was not so important. <richlv> 1) sounds reasonable, changed; zalex_ua 1) thanks, closed. <richlv> 2) during server shutdown, main process clears out history/trend caches, and hopefully does so by fully processing triggers, it services etc zalex_ua hhm, didn't know that, thanks. |
Comment by Marc [ 2013 Dec 20 ] |
will it find its way into 2.0 too? |
Comment by Aleksandrs Saveljevs [ 2013 Dec 20 ] |
The fix is for 2.0, yes. |
Comment by Andris Zeila [ 2013 Dec 27 ] |
(2) The src/libs/zbxdbcache/dbconfig.c:DCconfig_lock_triggers_by_itemids() function:
asaveljevs Comments:
asaveljevs Documented the first issue in 41179. RESOLVED. asaveljevs Andris explained that we actually use indices[] array to remove elements from cache->itemids and itemids[] is not needed indeed. So going to remove can_take[] array. REOPENED. asaveljevs RESOLVED in r41180. wiper CLOSED |
Comment by Andris Zeila [ 2014 Jan 02 ] |
Successfully tested |
Comment by Aleksandrs Saveljevs [ 2014 Jan 02 ] |
There were a lot of conflicts when merging the changes from 2.0 to 2.2, so I created a branch for 2.2 and started porting changes one by one. In the process I noticed a bug in 2.0 version, so I recreated a branch for 2.0 to fix it. |
Comment by Aleksandrs Saveljevs [ 2014 Jan 02 ] |
The fix is available in development branch svn://svn.zabbix.com/branches/dev/ZBX-7484-20 (note the "-20" at the end). Please take a look. wiper Looks good, please review small changes in r41248. asaveljevs Almost good, except you did not bring back the "const" qualifier for DCconfig_unlock_triggers(). Please see r41250 and also r41251. wiper ... and r41252 asaveljevs Thank you! CLOSED. |
Comment by Aleksandrs Saveljevs [ 2014 Jan 03 ] |
Resolved conflicts merging from 2.0 into 2.2 are available in development branch svn://svn.zabbix.com/branches/dev/ZBX-7484 . wiper looks good, CLOSED |
Comment by Aleksandrs Saveljevs [ 2014 Jan 03 ] |
Fixed in pre-2.0.11 r41255, pre-2.2.2 r41268, and pre-2.3.0 (trunk) r41269. |
Comment by richlv [ 2014 Apr 08 ] |
subissue (1) has not been closed yet, reopening |
Comment by Oleksii Zagorskyi [ 2014 Apr 08 ] |
(1) is closed -> issue closed again. |
Comment by MATSUDA Daiki [ 2015 Nov 13 ] |
This fix occures the problem |