Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-23633

Service alarms storing values in wrong order and affecting the SLA

    XMLWordPrintable

Details

    • Problem report
    • Resolution: Unresolved
    • Critical
    • 6.4.6
    • Server (S)
    • None
    • Zabbix server and proxies running on containers (podman).
    • Team A
    • S2401-1, S24-W6/7, S24-W8/9
    • 0.5

    Description

      Hi,
      We are dealing with an issue related to the SLA calculation been affected by entries that seems to be wrongly inserted on the Database. As you can see on the screenshot below, the Service (with id 4668) is changing the value from 5 (Disaster) to -1 (Ok), but on the highlighted portion of the screenshot you can see that the serverity is change from -1 to -1, and also from 5 to 5. This is directly affecting the calculation of the SLA for our customers.

      Currently we are proceeding with the manual update and fixing everything related to it, which is quite troublesome, for if we insert one single wrong information, the outcome can be a problem.
      As evidence that this is causing the SLI of our reports to be affected, I have performed some API requests to demonstrate:
      Before update / After update

      After investigating a few, we checked that this scenario is being affected by a "Negative duration" scenario. I don't believe it is a coincidence since it happens numerous times on the day. 

      As for the negative event occurrence, we have checked over and over, but it doesn't seem to be cause by unsynced time between Server and Proxy, for example. We have a NTP server that is taking care of our whole network and all the time we check, it is synced up.

      The type of items related to this are always ICMP ping items. 
      I can show below an example of the ocurrence:

      Time between hosts synced:

      As they run in containers, we also checked the possibility of having unsynced time between them as well, and everything is fine.

      This behavior seems be going against what is defined on the Documentation:

      NOTE: Negative problem duration is not affecting SLA calculation or Availability report of a particular trigger in any way; it neither reduces nor expands problem time.

      As it is indeed affecting the our SLAs report.

      As a workaround, as already mentioned, we are having to apply changes manually to obtain the correct SLI and SLA report.

      **

      Attachments

        1. another-kind-of-item.png
          another-kind-of-item.png
          174 kB
        2. another-kind-of-item-1.png
          another-kind-of-item-1.png
          174 kB
        3. de-volta-v6-4-11.png
          de-volta-v6-4-11.png
          510 kB
        4. image.png
          image.png
          418 kB
        5. image-2023-10-31-16-10-26-649.png
          image-2023-10-31-16-10-26-649.png
          261 kB
        6. image-2023-10-31-16-14-42-261.png
          image-2023-10-31-16-14-42-261.png
          19 kB
        7. image-2023-10-31-16-14-58-544.png
          image-2023-10-31-16-14-58-544.png
          19 kB
        8. image-2023-10-31-16-24-33-894.png
          image-2023-10-31-16-24-33-894.png
          418 kB
        9. image-2023-10-31-16-26-41-047.png
          image-2023-10-31-16-26-41-047.png
          67 kB
        10. image-2023-10-31-16-34-44-134.png
          image-2023-10-31-16-34-44-134.png
          534 kB
        11. keep-with-bug.png
          keep-with-bug.png
          262 kB

        Activity

          People

            vso Vladislavs Sokurenko
            vcredidio Victor Breda Credidio
            Votes:
            11 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated: