Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-17194

Trigger goes first OK and then recovers to OK, instead of PROBLEM and then OK

    XMLWordPrintable

    Details

    • Type: Problem report
    • Status: Open
    • Priority: Trivial
    • Resolution: Unresolved
    • Affects Version/s: 4.0.15
    • Fix Version/s: None
    • Component/s: Server (S)
    • Labels:
      None
    • Environment:
      Debian 9 Stretch, 4 vCPU, 8 GB, 240 NVPS, database on a separate VM

      Description

      In some cases there is no PROBLEM event for a trigger but an OK event, followed by another OK event.

      Configured action operations message:

      Trigger: {EVENT.NAME}
      Trigger status: {TRIGGER.STATUS}
      Trigger severity: {TRIGGER.SEVERITY}
      Host IP: {HOST.IP}
      Hostname: {HOST.HOST}
      Event time: {EVENT.DATE} at {EVENT.TIME}

      Item values:
      1. {ITEM.NAME1} ({HOST.NAME1}:{ITEM.KEY1}): {ITEM.VALUE1}
      2. {ITEM.NAME2} ({HOST.NAME2}:{ITEM.KEY2}): {ITEM.VALUE2}
      3. {ITEM.NAME3} ({HOST.NAME3}:{ITEM.KEY3}): {ITEM.VALUE3}

      Original event ID: {EVENT.ID}

      Configured action recovery operations message:

      Trigger: {EVENT.NAME}
      Trigger status: {TRIGGER.STATUS}
      Trigger severity: {TRIGGER.SEVERITY}
      Host IP: {HOST.IP}
      Hostname: {HOST.HOST}
      Event recovery time: {EVENT.RECOVERY.DATE} at {EVENT.RECOVERY.TIME}
      Original event time: {EVENT.DATE} at {EVENT.TIME}

      Item values:
      1. {ITEM.NAME1} ({HOST.NAME1}:{ITEM.KEY1}): {ITEM.VALUE1}
      2. {ITEM.NAME2} ({HOST.NAME2}:{ITEM.KEY2}): {ITEM.VALUE2}
      3. {ITEM.NAME3} ({HOST.NAME3}:{ITEM.KEY3}): {ITEM.VALUE3}

      Recovery event ID: {EVENT.RECOVERY.ID}
      Original event ID: {EVENT.ID}

       

      Trigger expression for the example below:

      {corertr-1:ciscoBgpPeerState[10.22.44.50].last(0)}

      =6 and {corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50].last(0)}<0.8*{corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50].avg(86400)}

      (meaning: trigger if BGP peer state is established AND accepted prefixes are less than 80% of the average prefixes during the last one day)

       

      Actual first message received:

      Trigger: BGP peer 10.83.44.50 has lost more than 20% of prefixes
      Trigger status: OK
      Trigger severity: Average
      Host IP: 10.22.0.1
      Hostname: corertr-1
      Event time: 2020.01.16 at 22:10:39

      Item values:
      1. Operational status for peer 10.22.44.50 (corertr-1:ciscoBgpPeerState[10.22.44.50]): established (6)
      2. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 1 Prefix
      3. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 1 Prefix

      Original event ID: 58648908

      (Comment: This trigger is valid because the prefixes dropped from 4 to 1. But why is the state OK and not PROBLEM?)

      Then the recovery message received:

      Trigger: BGP peer 10.83.22.50 has lost more than 20% of prefixes
      Trigger status: OK
      Trigger severity: Average
      Host IP: 10.22.0.1
      Hostname: corertr-1
      Event recovery time: 2020.01.16 at 22:10:39
      Original event time: 2020.01.16 at 22:10:39

      Item values:
      1. Operational status for peer 10.22.44.50 (corertr-1:ciscoBgpPeerState[10.22.44.50]): established (6)
      2. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 4 Prefix
      3. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 4 Prefix

      Recovery event ID: 58648909
      Original event ID: 58648908

      Observations:

      • The original event time and the event recovery timestamps are the same.
      • There is no PROBLEM state at all, first is OK and then OK.
      • (I believe those two observations are interconnected: the first state is OK only if the recovery happens at the same time.)
      • The trigger event is valid as such I believe: the number of prefixes dropped momentarily, according to the item values.

      Questions:

      • How is it possible that the event is triggered and resolved at the same time?
      • Why TRIGGER.STATUS is OK in the first message, and not PROBLEM as usual?

      I have similar observations with other metrics as well (non-BGP related). This happens every now and then, but I haven't got any other examples right away.

        Attachments

          Activity

            People

            Assignee:
            dmitrijs.lamberts Dmitrijs Lamberts
            Reporter:
            markkul Markku Leiniö
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Dates

              Created:
              Updated: