-
Problem report
-
Resolution: Won't fix
-
Trivial
-
None
-
4.0.15, 5.0.10
-
None
-
Debian 9 Stretch, 4 vCPU, 8 GB, 240 NVPS, database on a separate VM
-
Sprint 77 (Jun 2021), Sprint 78 (Jul 2021), Sprint 79 (Aug 2021), Sprint 80 (Sep 2021), Sprint 81 (Oct 2021), Sprint 82 (Nov 2021), Sprint 83 (Dec 2021), Sprint 84 (Jan 2022)
-
0.1
In some cases there is no PROBLEM event for a trigger but an OK event, followed by another OK event.
Configured action operations message:
Trigger: {EVENT.NAME}
Trigger status: {TRIGGER.STATUS}
Trigger severity: {TRIGGER.SEVERITY}
Host IP: {HOST.IP}
Hostname: {HOST.HOST}
Event time: {EVENT.DATE} at {EVENT.TIME}
Item values:
1. {ITEM.NAME1} ({HOST.NAME1}:{ITEM.KEY1}): {ITEM.VALUE1}
2. {ITEM.NAME2} ({HOST.NAME2}:{ITEM.KEY2}): {ITEM.VALUE2}
3. {ITEM.NAME3} ({HOST.NAME3}:{ITEM.KEY3}): {ITEM.VALUE3}
Original event ID: {EVENT.ID}
Configured action recovery operations message:
Trigger: {EVENT.NAME}
Trigger status: {TRIGGER.STATUS}
Trigger severity: {TRIGGER.SEVERITY}
Host IP: {HOST.IP}
Hostname: {HOST.HOST}
Event recovery time: {EVENT.RECOVERY.DATE} at {EVENT.RECOVERY.TIME}
Original event time: {EVENT.DATE} at {EVENT.TIME}
Item values:
1. {ITEM.NAME1} ({HOST.NAME1}:{ITEM.KEY1}): {ITEM.VALUE1}
2. {ITEM.NAME2} ({HOST.NAME2}:{ITEM.KEY2}): {ITEM.VALUE2}
3. {ITEM.NAME3} ({HOST.NAME3}:{ITEM.KEY3}): {ITEM.VALUE3}
Recovery event ID: {EVENT.RECOVERY.ID}
Original event ID: {EVENT.ID}
Trigger expression for the example below:
{corertr-1:ciscoBgpPeerState[10.22.44.50].last(0)}=6 and {corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50].last(0)}<0.8*{corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50].avg(86400)}
(meaning: trigger if BGP peer state is established AND accepted prefixes are less than 80% of the average prefixes during the last one day)
Actual first message received:
Trigger: BGP peer 10.83.44.50 has lost more than 20% of prefixes
Trigger status: OK
Trigger severity: Average
Host IP: 10.22.0.1
Hostname: corertr-1
Event time: 2020.01.16 at 22:10:39
Item values:
1. Operational status for peer 10.22.44.50 (corertr-1:ciscoBgpPeerState[10.22.44.50]): established (6)
2. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 1 Prefix
3. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 1 Prefix
Original event ID: 58648908
(Comment: This trigger is valid because the prefixes dropped from 4 to 1. But why is the state OK and not PROBLEM?)
Then the recovery message received:
Trigger: BGP peer 10.83.22.50 has lost more than 20% of prefixes
Trigger status: OK
Trigger severity: Average
Host IP: 10.22.0.1
Hostname: corertr-1
Event recovery time: 2020.01.16 at 22:10:39
Original event time: 2020.01.16 at 22:10:39
Item values:
1. Operational status for peer 10.22.44.50 (corertr-1:ciscoBgpPeerState[10.22.44.50]): established (6)
2. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 4 Prefix
3. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 4 Prefix
Recovery event ID: 58648909
Original event ID: 58648908
Observations:
- The original event time and the event recovery timestamps are the same.
- There is no PROBLEM state at all, first is OK and then OK.
- (I believe those two observations are interconnected: the first state is OK only if the recovery happens at the same time.)
- The trigger event is valid as such I believe: the number of prefixes dropped momentarily, according to the item values.
Questions:
- How is it possible that the event is triggered and resolved at the same time?
- Why TRIGGER.STATUS is OK in the first message, and not PROBLEM as usual?
I have similar observations with other metrics as well (non-BGP related). This happens every now and then, but I haven't got any other examples right away.