ZABBIX BUGS AND ISSUES
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-7484

Escalations still escalate even when trigger is OK

    Details

      Description

      It happens only on big installation and can be only with triggers with a few items in expression or one item (and time related function).

        Issue Links

          Activity

          Hide
          Aleksandrs Saveljevs added a comment -

          The problem seems to happen on very busy systems where one trigger is processed simultaneously by two processes (e.g., two history syncer processes or one history syncer and one timer).

          Namely, consider function process_actions() in src/zabbix_server/actions.c. Suppose one process, call it A, discovered that a trigger is in a PROBLEM state and starts an escalation. It then goes on processing other events. At the same time, process B, processes another value used in that trigger and discovered that the trigger became OK after that PROBLEM. It sets the trigger to OK, inserts the event, and does select from the database to see whether there are any started escalations (actions.c:1264). However, since A is busy and did not commit its transaction yet, B does not see the escalation started by A, and does not stop the escalation.

          So, in the end, we have that the trigger is correctly in OK state, events are correctly generated, escalation is started, but is not stopped.

          Show
          Aleksandrs Saveljevs added a comment - The problem seems to happen on very busy systems where one trigger is processed simultaneously by two processes (e.g., two history syncer processes or one history syncer and one timer). Namely, consider function process_actions() in src/zabbix_server/actions.c. Suppose one process, call it A, discovered that a trigger is in a PROBLEM state and starts an escalation. It then goes on processing other events. At the same time, process B, processes another value used in that trigger and discovered that the trigger became OK after that PROBLEM. It sets the trigger to OK, inserts the event, and does select from the database to see whether there are any started escalations (actions.c:1264). However, since A is busy and did not commit its transaction yet, B does not see the escalation started by A, and does not stop the escalation. So, in the end, we have that the trigger is correctly in OK state, events are correctly generated, escalation is started, but is not stopped.
          Hide
          Aleksandrs Saveljevs added a comment -

          Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-7484 .

          The main idea behind the fix is that timers and history syncers now lock triggers they are processing and such locking happens in configuration cache.

          This has the effect that two history syncers cannot simultaneously process two items that are attached to the same trigger. Another effect is that now timers will skip processing of triggers that are already being processed by history syncers. The idea is that history syncers will evaluate time functions, too, so there is no point in separate processing by timers.

          Another change worth mentioning is that now timer processes will take triggers to process in batches, instead of all triggers at once.

          I have tested the changes on 2000 hosts linked to "Template OS Linux", augmented with a bunch of time-based triggers. The following graph shows the effect the changes had on performance:

          Either there is a bug, or changes have been positive.

          Show
          Aleksandrs Saveljevs added a comment - Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-7484 . The main idea behind the fix is that timers and history syncers now lock triggers they are processing and such locking happens in configuration cache. This has the effect that two history syncers cannot simultaneously process two items that are attached to the same trigger. Another effect is that now timers will skip processing of triggers that are already being processed by history syncers. The idea is that history syncers will evaluate time functions, too, so there is no point in separate processing by timers. Another change worth mentioning is that now timer processes will take triggers to process in batches, instead of all triggers at once. I have tested the changes on 2000 hosts linked to "Template OS Linux", augmented with a bunch of time-based triggers. The following graph shows the effect the changes had on performance: Either there is a bug, or changes have been positive.
          Hide
          richlv added a comment - - edited

          (1) that looks really great - let's include performance improvement in appropriate whatsnew

          Aleksandrs Saveljevs Documented that at https://www.zabbix.com/documentation/2.0/manual/introduction/whatsnew2011?&#daemon_improvements . RESOLVED.

          <richlv> as discussed on irc, also added minor clarification about processes that process triggers -> CLOSED

          Aleksandrs Saveljevs Documented that at https://www.zabbix.com/documentation/2.2/manual/introduction/whatsnew222?&#daemon_improvements , too.

          <richlv> thanks, CLOSED

          Oleksiy Zagorskyi I don't like how we described the changes.
          Before I looked to this issue I understood changes (described in doc) incorrectly.
          If we read only doc then it looks that only one syncer or timer at all will process triggers, other ones will do other tasks.
          What If we change it to this:
          "A complex trigger can now only be processed by one process (history syncer or timer) at a time, ..."
          ?
          REOPENED

          <richlv> hmm, i don't really like introduction of "complex triggers", we would have to define that. also, is that correct ? i suppose timer could collide with hsyncer even on triggers that do not reference multiple items. as for only one process overall processing triggers, i wouldn't read it as that - see the "at a time" part, does that make the sentence clear ?
          regarding "complex" trigger part, we'd need a comment from asaveljevs on that

          Oleksiy Zagorskyi ok, I agree to not use "complex" term. It was not so important.
          My main goals:
          1) to not use plural form because it looks like "for all rigger in zabbix" only one zabbix process. Use singular form - "A trigger ... "
          2) don't use some unknown "main ... process" term, because we would have to define that . My form "one process (history syncer or timer) at a time" is simple and clear, IMO.

          <richlv> 1) sounds reasonable, changed;
          2) we don't have to define it any more than poller, timer or history syncer
          "server #0 started [main process]"

          Oleksiy Zagorskyi 1) thanks, closed.
          2) hmm, what is "main" process in documentation? Does "server #0 started [main process]" able to process triggers ?

          <richlv> 2) during server shutdown, main process clears out history/trend caches, and hopefully does so by fully processing triggers, it services etc

          Oleksiy Zagorskyi hhm, didn't know that, thanks.
          well, CLOSED.

          Show
          richlv added a comment - - edited (1) that looks really great - let's include performance improvement in appropriate whatsnew Aleksandrs Saveljevs Documented that at https://www.zabbix.com/documentation/2.0/manual/introduction/whatsnew2011?&#daemon_improvements . RESOLVED. <richlv> as discussed on irc, also added minor clarification about processes that process triggers -> CLOSED Aleksandrs Saveljevs Documented that at https://www.zabbix.com/documentation/2.2/manual/introduction/whatsnew222?&#daemon_improvements , too. <richlv> thanks, CLOSED Oleksiy Zagorskyi I don't like how we described the changes. Before I looked to this issue I understood changes (described in doc) incorrectly. If we read only doc then it looks that only one syncer or timer at all will process triggers, other ones will do other tasks. What If we change it to this: "A complex trigger can now only be processed by one process (history syncer or timer) at a time, ..." ? REOPENED <richlv> hmm, i don't really like introduction of "complex triggers", we would have to define that. also, is that correct ? i suppose timer could collide with hsyncer even on triggers that do not reference multiple items. as for only one process overall processing triggers, i wouldn't read it as that - see the "at a time" part, does that make the sentence clear ? regarding "complex" trigger part, we'd need a comment from asaveljevs on that Oleksiy Zagorskyi ok, I agree to not use "complex" term. It was not so important. My main goals: 1) to not use plural form because it looks like "for all rigger in zabbix" only one zabbix process. Use singular form - "A trigger ... " 2) don't use some unknown "main ... process" term, because we would have to define that . My form "one process (history syncer or timer) at a time" is simple and clear, IMO. <richlv> 1) sounds reasonable, changed; 2) we don't have to define it any more than poller, timer or history syncer "server #0 started [main process] " Oleksiy Zagorskyi 1) thanks, closed. 2) hmm, what is "main" process in documentation? Does "server #0 started [main process] " able to process triggers ? <richlv> 2) during server shutdown, main process clears out history/trend caches, and hopefully does so by fully processing triggers, it services etc Oleksiy Zagorskyi hhm, didn't know that, thanks. well, CLOSED.
          Hide
          Marc added a comment -

          will it find its way into 2.0 too?

          Show
          Marc added a comment - will it find its way into 2.0 too?
          Hide
          Aleksandrs Saveljevs added a comment -

          The fix is for 2.0, yes.

          Show
          Aleksandrs Saveljevs added a comment - The fix is for 2.0, yes.
          Hide
          Andris Zeila added a comment - - edited

          (2) The src/libs/zbxdbcache/dbconfig.c:DCconfig_lock_triggers_by_itemids() function:

          • This function expects the can_take array to be initialized with 1. It would be better if the array would be initialized inside it or at least document it. One solution might be to drop the can_take parameter and set the corresponding itemids array element to zero.
          • goto next can be replaced with continue like in the first check if item or its triggers exists
          • if two items used in one trigger are passed in itemids, then the trigger will be locked for the first item and the second item will be marked as 'can't take'

          Aleksandrs Saveljevs Comments:

          • Regarding can_take initialization, I thought it would be better to initialize it outside the function, because DCconfig_lock_triggers_by_itemids() is only called in server, but not in proxy. Regarding setting itemids array elements to zero, we cannot do that in the current scheme, because we have to know which item IDs we cannot take and have to remove them from cache->itemids. I believe this approach of adding them to cache->itemids, then calling DCconfig_lock_triggers_by_itemids(), then removing if there is a conflict is a good one, because removing is an exceptional situation and should only happen very rarely.
          • goto next cannot be replaced with continue, because continue will affect the trigger loop ("for (j = 0; NULL != (dc_trigger = dc_item->triggers[j]); j++)"), not the item loop ("for (i = 0; i < itemids_num; i++)").
          • This is intended and helps to process trigger events in the correct order. Suppose our history cache has three records for two items: A, A, B. In the implemented scheme, we will process the first A, then the second A, then B. If we allow to take B if A is taken, then we will process the first A, then B, then the second A, resulting in the events out of order.

          Aleksandrs Saveljevs Documented the first issue in 41179. RESOLVED.

          Aleksandrs Saveljevs Andris explained that we actually use indices[] array to remove elements from cache->itemids and itemids[] is not needed indeed. So going to remove can_take[] array. REOPENED.

          Aleksandrs Saveljevs RESOLVED in r41180.

          Andris Zeila CLOSED

          Show
          Andris Zeila added a comment - - edited (2) The src/libs/zbxdbcache/dbconfig.c:DCconfig_lock_triggers_by_itemids() function: This function expects the can_take array to be initialized with 1. It would be better if the array would be initialized inside it or at least document it. One solution might be to drop the can_take parameter and set the corresponding itemids array element to zero. goto next can be replaced with continue like in the first check if item or its triggers exists if two items used in one trigger are passed in itemids, then the trigger will be locked for the first item and the second item will be marked as 'can't take' Aleksandrs Saveljevs Comments: Regarding can_take initialization, I thought it would be better to initialize it outside the function, because DCconfig_lock_triggers_by_itemids() is only called in server, but not in proxy. Regarding setting itemids array elements to zero, we cannot do that in the current scheme, because we have to know which item IDs we cannot take and have to remove them from cache->itemids. I believe this approach of adding them to cache->itemids, then calling DCconfig_lock_triggers_by_itemids(), then removing if there is a conflict is a good one, because removing is an exceptional situation and should only happen very rarely. goto next cannot be replaced with continue , because continue will affect the trigger loop ("for (j = 0; NULL != (dc_trigger = dc_item->triggers [j] ); j++)"), not the item loop ("for (i = 0; i < itemids_num; i++)"). This is intended and helps to process trigger events in the correct order. Suppose our history cache has three records for two items: A, A, B. In the implemented scheme, we will process the first A, then the second A, then B. If we allow to take B if A is taken, then we will process the first A, then B, then the second A, resulting in the events out of order. Aleksandrs Saveljevs Documented the first issue in 41179. RESOLVED. Aleksandrs Saveljevs Andris explained that we actually use indices[] array to remove elements from cache->itemids and itemids[] is not needed indeed. So going to remove can_take[] array. REOPENED. Aleksandrs Saveljevs RESOLVED in r41180. Andris Zeila CLOSED
          Hide
          Andris Zeila added a comment -

          Successfully tested

          Show
          Andris Zeila added a comment - Successfully tested
          Hide
          Aleksandrs Saveljevs added a comment -

          There were a lot of conflicts when merging the changes from 2.0 to 2.2, so I created a branch for 2.2 and started porting changes one by one.

          In the process I noticed a bug in 2.0 version, so I recreated a branch for 2.0 to fix it.

          Show
          Aleksandrs Saveljevs added a comment - There were a lot of conflicts when merging the changes from 2.0 to 2.2, so I created a branch for 2.2 and started porting changes one by one. In the process I noticed a bug in 2.0 version, so I recreated a branch for 2.0 to fix it.
          Hide
          Aleksandrs Saveljevs added a comment - - edited

          The fix is available in development branch svn://svn.zabbix.com/branches/dev/ZBX-7484-20 (note the "-20" at the end). Please take a look.

          Andris Zeila Looks good, please review small changes in r41248.

          Aleksandrs Saveljevs Almost good, except you did not bring back the "const" qualifier for DCconfig_unlock_triggers(). Please see r41250 and also r41251.

          Andris Zeila ... and r41252

          Aleksandrs Saveljevs Thank you! CLOSED.

          Show
          Aleksandrs Saveljevs added a comment - - edited The fix is available in development branch svn://svn.zabbix.com/branches/dev/ZBX-7484-20 (note the "-20" at the end). Please take a look. Andris Zeila Looks good, please review small changes in r41248. Aleksandrs Saveljevs Almost good, except you did not bring back the "const" qualifier for DCconfig_unlock_triggers(). Please see r41250 and also r41251. Andris Zeila ... and r41252 Aleksandrs Saveljevs Thank you! CLOSED.
          Hide
          Aleksandrs Saveljevs added a comment - - edited

          Resolved conflicts merging from 2.0 into 2.2 are available in development branch svn://svn.zabbix.com/branches/dev/ZBX-7484 .

          Andris Zeila looks good, CLOSED

          Show
          Aleksandrs Saveljevs added a comment - - edited Resolved conflicts merging from 2.0 into 2.2 are available in development branch svn://svn.zabbix.com/branches/dev/ZBX-7484 . Andris Zeila looks good, CLOSED
          Hide
          Aleksandrs Saveljevs added a comment -

          Fixed in pre-2.0.11 r41255, pre-2.2.2 r41268, and pre-2.3.0 (trunk) r41269.

          Show
          Aleksandrs Saveljevs added a comment - Fixed in pre-2.0.11 r41255, pre-2.2.2 r41268, and pre-2.3.0 (trunk) r41269.
          Hide
          richlv added a comment -

          subissue (1) has not been closed yet, reopening

          Show
          richlv added a comment - subissue (1) has not been closed yet, reopening
          Hide
          Oleksiy Zagorskyi added a comment -

          (1) is closed -> issue closed again.

          Show
          Oleksiy Zagorskyi added a comment - (1) is closed -> issue closed again.
          Hide
          MATSUDA Daiki added a comment -

          This fix occures the problem ZBX-10064 on shutdown.

          Show
          MATSUDA Daiki added a comment - This fix occures the problem ZBX-10064 on shutdown.

            People

            • Assignee:
              Unassigned
              Reporter:
              Alexey Pustovalov
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: