Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-17194

Trigger goes first OK and then recovers to OK, instead of PROBLEM and then OK

    XMLWordPrintable

Details

    • Problem report
    • Status: Elaborating
    • Trivial
    • Resolution: Unresolved
    • 4.0.15, 5.0.10
    • None
    • Server (S)
    • None
    • Debian 9 Stretch, 4 vCPU, 8 GB, 240 NVPS, database on a separate VM
    • Team A
    • Sprint 77 (Jun 2021), Sprint 78 (Jul 2021), Sprint 79 (Aug 2021), Sprint 80 (Sep 2021), Sprint 81 (Oct 2021), Sprint 82 (Nov 2021), Sprint 83 (Dec 2021)

    Description

      In some cases there is no PROBLEM event for a trigger but an OK event, followed by another OK event.

      Configured action operations message:

      Trigger: {EVENT.NAME}
      Trigger status: {TRIGGER.STATUS}
      Trigger severity: {TRIGGER.SEVERITY}
      Host IP: {HOST.IP}
      Hostname: {HOST.HOST}
      Event time: {EVENT.DATE} at {EVENT.TIME}

      Item values:
      1. {ITEM.NAME1} ({HOST.NAME1}:{ITEM.KEY1}): {ITEM.VALUE1}
      2. {ITEM.NAME2} ({HOST.NAME2}:{ITEM.KEY2}): {ITEM.VALUE2}
      3. {ITEM.NAME3} ({HOST.NAME3}:{ITEM.KEY3}): {ITEM.VALUE3}

      Original event ID: {EVENT.ID}

      Configured action recovery operations message:

      Trigger: {EVENT.NAME}
      Trigger status: {TRIGGER.STATUS}
      Trigger severity: {TRIGGER.SEVERITY}
      Host IP: {HOST.IP}
      Hostname: {HOST.HOST}
      Event recovery time: {EVENT.RECOVERY.DATE} at {EVENT.RECOVERY.TIME}
      Original event time: {EVENT.DATE} at {EVENT.TIME}

      Item values:
      1. {ITEM.NAME1} ({HOST.NAME1}:{ITEM.KEY1}): {ITEM.VALUE1}
      2. {ITEM.NAME2} ({HOST.NAME2}:{ITEM.KEY2}): {ITEM.VALUE2}
      3. {ITEM.NAME3} ({HOST.NAME3}:{ITEM.KEY3}): {ITEM.VALUE3}

      Recovery event ID: {EVENT.RECOVERY.ID}
      Original event ID: {EVENT.ID}

       

      Trigger expression for the example below:

      {corertr-1:ciscoBgpPeerState[10.22.44.50].last(0)}

      =6 and {corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50].last(0)}<0.8*{corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50].avg(86400)}

      (meaning: trigger if BGP peer state is established AND accepted prefixes are less than 80% of the average prefixes during the last one day)

       

      Actual first message received:

      Trigger: BGP peer 10.83.44.50 has lost more than 20% of prefixes
      Trigger status: OK
      Trigger severity: Average
      Host IP: 10.22.0.1
      Hostname: corertr-1
      Event time: 2020.01.16 at 22:10:39

      Item values:
      1. Operational status for peer 10.22.44.50 (corertr-1:ciscoBgpPeerState[10.22.44.50]): established (6)
      2. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 1 Prefix
      3. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 1 Prefix

      Original event ID: 58648908

      (Comment: This trigger is valid because the prefixes dropped from 4 to 1. But why is the state OK and not PROBLEM?)

      Then the recovery message received:

      Trigger: BGP peer 10.83.22.50 has lost more than 20% of prefixes
      Trigger status: OK
      Trigger severity: Average
      Host IP: 10.22.0.1
      Hostname: corertr-1
      Event recovery time: 2020.01.16 at 22:10:39
      Original event time: 2020.01.16 at 22:10:39

      Item values:
      1. Operational status for peer 10.22.44.50 (corertr-1:ciscoBgpPeerState[10.22.44.50]): established (6)
      2. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 4 Prefix
      3. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 4 Prefix

      Recovery event ID: 58648909
      Original event ID: 58648908

      Observations:

      • The original event time and the event recovery timestamps are the same.
      • There is no PROBLEM state at all, first is OK and then OK.
      • (I believe those two observations are interconnected: the first state is OK only if the recovery happens at the same time.)
      • The trigger event is valid as such I believe: the number of prefixes dropped momentarily, according to the item values.

      Questions:

      • How is it possible that the event is triggered and resolved at the same time?
      • Why TRIGGER.STATUS is OK in the first message, and not PROBLEM as usual?

      I have similar observations with other metrics as well (non-BGP related). This happens every now and then, but I haven't got any other examples right away.

      Attachments

        Activity

          People

            dgoloscapov Dmitrijs Goloscapovs
            markkul Markku Leiniö
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: