ZABBIX BUGS AND ISSUES
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-4732

Events with wrong timestamp during high load on zabbix server -> wrong Availability report

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Cannot Reproduce
    • Affects Version/s: 1.8.10
    • Fix Version/s: None
    • Component/s: Frontend (F), Server (S)
    • Labels:
    • Environment:
      Linux (EL 6), Mysql (5.1)

      Description

      Sometimes we observe a situation when zabbix produces false positive alerts (based on trigger

      {agent.ping.nodata(180)}

      =1) during high IO load on the server (i.e. while running a backup on FS, which holds the zabbix database). We suppose that this leads to a problem when events are stored in switched order, which is probably caused by wrong clock values. Although the event IDs seem to be stored in right order (please see the attached pictures). The Availability report generates then graphs with hosts mostly down.

      Maybe this is somehow related to #ZBX-4466.

      1. e0.png
        192 kB
      2. e1.png
        192 kB
      3. O_new_trigger_error.jpg
        250 kB
      4. zabbix_1.png
        14 kB
      5. zabbix_2.png
        5 kB

        Activity

        Hide
        Oleksiy Zagorskyi added a comment -

        I suppose that I managed to hit in this case several days ago when I played with a trigger expression where nodata(30-60) function has been used.

        1. ./zabbix_server18 -V
          Zabbix Server v1.8.11rc1 (revision 25522) (28 December 2011)
          Compilation time: Feb 22 2012 10:42:33
        Show
        Oleksiy Zagorskyi added a comment - I suppose that I managed to hit in this case several days ago when I played with a trigger expression where nodata(30-60) function has been used. ./zabbix_server18 -V Zabbix Server v1.8.11rc1 (revision 25522) (28 December 2011) Compilation time: Feb 22 2012 10:42:33
        Hide
        Daniel Kontsek added a comment -

        It's mentioned in the bug report -

        {agent.ping.nodata(180)}

        =1

        Show
        Daniel Kontsek added a comment - It's mentioned in the bug report - {agent.ping.nodata(180)} =1
        Hide
        Oleksiy Zagorskyi added a comment - - edited

        Very similar issue is ZBX-4763, maybe even a source of the problem is the same.

        <ADDED> also similar issue ZBX-6170

        Show
        Oleksiy Zagorskyi added a comment - - edited Very similar issue is ZBX-4763 , maybe even a source of the problem is the same. <ADDED> also similar issue ZBX-6170
        Hide
        Oleksiy Zagorskyi added a comment -

        Heh, I would like to share with my opinion.

        At pictures attached by Daniel (the issue reporter) we see repeated several PROBLEM events in a row. It's not very clear why that happened. I cannot imagine.
        Additionally we don't know update interval for that item, maybe it is 180 seconds ? (the same as trigger function).
        But in the events we see that OK event generated exactly at start of minute. OK events can be generated only by "db syncer" process when some value is received and the trigger is in PROBLEM state.
        PROBLEM events can be generated only by "timer" process when trigger is in OK or UNKNOWN (because of server restart) states.

        So, I suppose data came from the item exactly at the start of minute (processed by "db syncer"), and we know that "timer" executed every 30 seconds exactly at 00 and 30 seconds.

        The events, to which Daniel draws attention, are less interesting for me than other events.
        For example:
        10:16:00 - PROBLEM,
        10:16:00 - OK
        and
        00:04:00 - PROBLEM,
        00:04:01 - OK

        I can show more clear case. See attached "O_new_trigger_error.jpg", there you will find all detail.
        We see that the trigger has been processed by two processes almost in the same time.
        11:02:31 = eventID 10959666 - "timer" process generated PROBLEM event
        11:02:30 = eventID 10959668 - "db syncer" process generated OK event

        Here is not clear why "db syncer" decided that the trigger is in PROBLEM state. It's possible that "timer" already changed it to PROBLEM (in some cache or in the table)
        And later "db syncer" probably changed the state to OK. :/

        So we should prevent such cases somehow.
        Zabbix server version is 1.8.6 in this last example.

        Show
        Oleksiy Zagorskyi added a comment - Heh, I would like to share with my opinion. At pictures attached by Daniel (the issue reporter) we see repeated several PROBLEM events in a row. It's not very clear why that happened. I cannot imagine. Additionally we don't know update interval for that item, maybe it is 180 seconds ? (the same as trigger function). But in the events we see that OK event generated exactly at start of minute. OK events can be generated only by "db syncer" process when some value is received and the trigger is in PROBLEM state. PROBLEM events can be generated only by "timer" process when trigger is in OK or UNKNOWN (because of server restart) states. So, I suppose data came from the item exactly at the start of minute (processed by "db syncer"), and we know that "timer" executed every 30 seconds exactly at 00 and 30 seconds. The events, to which Daniel draws attention, are less interesting for me than other events. For example: 10:16:00 - PROBLEM, 10:16:00 - OK and 00:04:00 - PROBLEM, 00:04:01 - OK I can show more clear case. See attached "O_new_trigger_error.jpg", there you will find all detail. We see that the trigger has been processed by two processes almost in the same time. 11:02:31 = eventID 10959666 - "timer" process generated PROBLEM event 11:02:30 = eventID 10959668 - "db syncer" process generated OK event Here is not clear why "db syncer" decided that the trigger is in PROBLEM state. It's possible that "timer" already changed it to PROBLEM (in some cache or in the table) And later "db syncer" probably changed the state to OK. :/ So we should prevent such cases somehow. Zabbix server version is 1.8.6 in this last example.
        Hide
        Daniel Kontsek added a comment -

        Item key: agent.ping
        Item type: Zabbix Agent
        Item update interval: 60 s
        Trigger:

        {agent.ping.nodata(180)}

        =1

        Show
        Daniel Kontsek added a comment - Item key: agent.ping Item type: Zabbix Agent Item update interval: 60 s Trigger: {agent.ping.nodata(180)} =1
        Hide
        Daniel Kontsek added a comment -

        Any news regarding this problem?

        Show
        Daniel Kontsek added a comment - Any news regarding this problem?
        Hide
        Cristian Mammoli added a comment - - edited

        Hi, we are having the exact same problem (see attachments)

        Zabbix 2.2.3, DB PostgreSQL 9.2

        Show
        Cristian Mammoli added a comment - - edited Hi, we are having the exact same problem (see attachments) Zabbix 2.2.3, DB PostgreSQL 9.2
        Hide
        Roelof Spijker added a comment -

        Seeing a very similar issue here on 2.2.3 with mysql. Events are generated in the incorrect order. The real order would be: Up - Down for 1 second - Up. But they are ordered as Up - Up - Down for 1 second. This causes the SLA to record it as being down up until the next issue occurs and is resolved. It's fixable by decreasing the clock in the DB for the events and service_alarms, but I'm not sure why it's happening in the first place.

        Show
        Roelof Spijker added a comment - Seeing a very similar issue here on 2.2.3 with mysql. Events are generated in the incorrect order. The real order would be: Up - Down for 1 second - Up. But they are ordered as Up - Up - Down for 1 second. This causes the SLA to record it as being down up until the next issue occurs and is resolved. It's fixable by decreasing the clock in the DB for the events and service_alarms, but I'm not sure why it's happening in the first place.
        Hide
        Oleksiy Zagorskyi added a comment -

        I feel that this issue is not very actual as for recent zabbix versions (2.4+).
        So I'm closing it as ... hmmm ... cannot reproduce.

        Feel free to ask to reopen if you think I did wrong thing.

        Just note that there are ZBX-8556 and ZBX-9432 which may look similar to current issue, but they are different.

        Show
        Oleksiy Zagorskyi added a comment - I feel that this issue is not very actual as for recent zabbix versions (2.4+). So I'm closing it as ... hmmm ... cannot reproduce. Feel free to ask to reopen if you think I did wrong thing. Just note that there are ZBX-8556 and ZBX-9432 which may look similar to current issue, but they are different.

          People

          • Assignee:
            Unassigned
            Reporter:
            Daniel Kontsek
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: