[ZBX-7484] Escalations still escalate even when trigger is OK Created: 2013 Dec 04  Updated: 2017 May 30  Resolved: 2014 Apr 08

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 2.0.10rc1, 2.2.1rc1
Fix Version/s: 2.0.11rc1, 2.2.2rc1, 2.3.0

Type: Incident report Priority: Major
Reporter: Alexey Pustovalov Assignee: Unassigned
Resolution: Fixed Votes: 1
Labels: escalations
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File performance.png    
Issue Links:
Duplicate
duplicates ZBX-7159 Duplicate events generated by change(... Closed

 Description   

It happens only on big installation and can be only with triggers with a few items in expression or one item (and time related function).



 Comments   
Comment by Aleksandrs Saveljevs [ 2013 Dec 09 ]

The problem seems to happen on very busy systems where one trigger is processed simultaneously by two processes (e.g., two history syncer processes or one history syncer and one timer).

Namely, consider function process_actions() in src/zabbix_server/actions.c. Suppose one process, call it A, discovered that a trigger is in a PROBLEM state and starts an escalation. It then goes on processing other events. At the same time, process B, processes another value used in that trigger and discovered that the trigger became OK after that PROBLEM. It sets the trigger to OK, inserts the event, and does select from the database to see whether there are any started escalations (actions.c:1264). However, since A is busy and did not commit its transaction yet, B does not see the escalation started by A, and does not stop the escalation.

So, in the end, we have that the trigger is correctly in OK state, events are correctly generated, escalation is started, but is not stopped.

Comment by Aleksandrs Saveljevs [ 2013 Dec 17 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-7484 .

The main idea behind the fix is that timers and history syncers now lock triggers they are processing and such locking happens in configuration cache.

This has the effect that two history syncers cannot simultaneously process two items that are attached to the same trigger. Another effect is that now timers will skip processing of triggers that are already being processed by history syncers. The idea is that history syncers will evaluate time functions, too, so there is no point in separate processing by timers.

Another change worth mentioning is that now timer processes will take triggers to process in batches, instead of all triggers at once.

I have tested the changes on 2000 hosts linked to "Template OS Linux", augmented with a bunch of time-based triggers. The following graph shows the effect the changes had on performance:

Either there is a bug, or changes have been positive.

Comment by richlv [ 2013 Dec 17 ]

(1) that looks really great - let's include performance improvement in appropriate whatsnew

asaveljevs Documented that at https://www.zabbix.com/documentation/2.0/manual/introduction/whatsnew2011?&#daemon_improvements . RESOLVED.

<richlv> as discussed on irc, also added minor clarification about processes that process triggers -> CLOSED

asaveljevs Documented that at https://www.zabbix.com/documentation/2.2/manual/introduction/whatsnew222?&#daemon_improvements , too.

<richlv> thanks, CLOSED

zalex_ua I don't like how we described the changes.
Before I looked to this issue I understood changes (described in doc) incorrectly.
If we read only doc then it looks that only one syncer or timer at all will process triggers, other ones will do other tasks.
What If we change it to this:
"A complex trigger can now only be processed by one process (history syncer or timer) at a time, ..."
?
REOPENED

<richlv> hmm, i don't really like introduction of "complex triggers", we would have to define that. also, is that correct ? i suppose timer could collide with hsyncer even on triggers that do not reference multiple items. as for only one process overall processing triggers, i wouldn't read it as that - see the "at a time" part, does that make the sentence clear ?
regarding "complex" trigger part, we'd need a comment from asaveljevs on that

zalex_ua ok, I agree to not use "complex" term. It was not so important.
My main goals:
1) to not use plural form because it looks like "for all rigger in zabbix" only one zabbix process. Use singular form - "A trigger ... "
2) don't use some unknown "main ... process" term, because we would have to define that . My form "one process (history syncer or timer) at a time" is simple and clear, IMO.

<richlv> 1) sounds reasonable, changed;
2) we don't have to define it any more than poller, timer or history syncer
"server #0 started [main process]"

zalex_ua 1) thanks, closed.
2) hmm, what is "main" process in documentation? Does "server #0 started [main process]" able to process triggers ?

<richlv> 2) during server shutdown, main process clears out history/trend caches, and hopefully does so by fully processing triggers, it services etc

zalex_ua hhm, didn't know that, thanks.
well, CLOSED.

Comment by Marc [ 2013 Dec 20 ]

will it find its way into 2.0 too?

Comment by Aleksandrs Saveljevs [ 2013 Dec 20 ]

The fix is for 2.0, yes.

Comment by Andris Zeila [ 2013 Dec 27 ]

(2) The src/libs/zbxdbcache/dbconfig.c:DCconfig_lock_triggers_by_itemids() function:

  • This function expects the can_take array to be initialized with 1. It would be better if the array would be initialized inside it or at least document it. One solution might be to drop the can_take parameter and set the corresponding itemids array element to zero.
  • goto next can be replaced with continue like in the first check if item or its triggers exists
  • if two items used in one trigger are passed in itemids, then the trigger will be locked for the first item and the second item will be marked as 'can't take'

asaveljevs Comments:

  • Regarding can_take initialization, I thought it would be better to initialize it outside the function, because DCconfig_lock_triggers_by_itemids() is only called in server, but not in proxy. Regarding setting itemids array elements to zero, we cannot do that in the current scheme, because we have to know which item IDs we cannot take and have to remove them from cache->itemids. I believe this approach of adding them to cache->itemids, then calling DCconfig_lock_triggers_by_itemids(), then removing if there is a conflict is a good one, because removing is an exceptional situation and should only happen very rarely.
  • goto next cannot be replaced with continue, because continue will affect the trigger loop ("for (j = 0; NULL != (dc_trigger = dc_item->triggers[j]); j++)"), not the item loop ("for (i = 0; i < itemids_num; i++)").
  • This is intended and helps to process trigger events in the correct order. Suppose our history cache has three records for two items: A, A, B. In the implemented scheme, we will process the first A, then the second A, then B. If we allow to take B if A is taken, then we will process the first A, then B, then the second A, resulting in the events out of order.

asaveljevs Documented the first issue in 41179. RESOLVED.

asaveljevs Andris explained that we actually use indices[] array to remove elements from cache->itemids and itemids[] is not needed indeed. So going to remove can_take[] array. REOPENED.

asaveljevs RESOLVED in r41180.

wiper CLOSED

Comment by Andris Zeila [ 2014 Jan 02 ]

Successfully tested

Comment by Aleksandrs Saveljevs [ 2014 Jan 02 ]

There were a lot of conflicts when merging the changes from 2.0 to 2.2, so I created a branch for 2.2 and started porting changes one by one.

In the process I noticed a bug in 2.0 version, so I recreated a branch for 2.0 to fix it.

Comment by Aleksandrs Saveljevs [ 2014 Jan 02 ]

The fix is available in development branch svn://svn.zabbix.com/branches/dev/ZBX-7484-20 (note the "-20" at the end). Please take a look.

wiper Looks good, please review small changes in r41248.

asaveljevs Almost good, except you did not bring back the "const" qualifier for DCconfig_unlock_triggers(). Please see r41250 and also r41251.

wiper ... and r41252

asaveljevs Thank you! CLOSED.

Comment by Aleksandrs Saveljevs [ 2014 Jan 03 ]

Resolved conflicts merging from 2.0 into 2.2 are available in development branch svn://svn.zabbix.com/branches/dev/ZBX-7484 .

wiper looks good, CLOSED

Comment by Aleksandrs Saveljevs [ 2014 Jan 03 ]

Fixed in pre-2.0.11 r41255, pre-2.2.2 r41268, and pre-2.3.0 (trunk) r41269.

Comment by richlv [ 2014 Apr 08 ]

subissue (1) has not been closed yet, reopening

Comment by Oleksii Zagorskyi [ 2014 Apr 08 ]

(1) is closed -> issue closed again.

Comment by MATSUDA Daiki [ 2015 Nov 13 ]

This fix occures the problem ZBX-10064 on shutdown.

Generated at Wed Apr 24 12:09:55 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.