[ZBX-17194] Trigger goes first OK and then recovers to OK, instead of PROBLEM and then OK Created: 2020 Jan 17 Updated: 2024 Apr 10 Resolved: 2022 Jan 25 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 4.0.15, 5.0.10 |
Fix Version/s: | None |
Type: | Problem report | Priority: | Trivial |
Reporter: | Markku Leiniö | Assignee: | Zabbix Development Team |
Resolution: | Won't fix | Votes: | 1 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Debian 9 Stretch, 4 vCPU, 8 GB, 240 NVPS, database on a separate VM |
Team: | |
Sprint: | Sprint 77 (Jun 2021), Sprint 78 (Jul 2021), Sprint 79 (Aug 2021), Sprint 80 (Sep 2021), Sprint 81 (Oct 2021), Sprint 82 (Nov 2021), Sprint 83 (Dec 2021), Sprint 84 (Jan 2022) |
Story Points: | 0.1 |
Description |
In some cases there is no PROBLEM event for a trigger but an OK event, followed by another OK event. Configured action operations message: Trigger: {EVENT.NAME} Item values: Original event ID: {EVENT.ID} Configured action recovery operations message: Trigger: {EVENT.NAME} Item values: Recovery event ID: {EVENT.RECOVERY.ID}
Trigger expression for the example below: {corertr-1:ciscoBgpPeerState[10.22.44.50].last(0)}=6 and {corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50].last(0)}<0.8*{corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50].avg(86400)} (meaning: trigger if BGP peer state is established AND accepted prefixes are less than 80% of the average prefixes during the last one day)
Actual first message received: Trigger: BGP peer 10.83.44.50 has lost more than 20% of prefixes Item values: Original event ID: 58648908 (Comment: This trigger is valid because the prefixes dropped from 4 to 1. But why is the state OK and not PROBLEM?) Then the recovery message received: Trigger: BGP peer 10.83.22.50 has lost more than 20% of prefixes Item values: Recovery event ID: 58648909 Observations:
Questions:
I have similar observations with other metrics as well (non-BGP related). This happens every now and then, but I haven't got any other examples right away. |
Comments |
Comment by Markku Leiniö [ 2020 Jan 17 ] | ||||
I have to give more information about this. These are the item values (5 minute interval): 2020-01-16 22:15:39 1 According to that data the recovery information above was just plain incorrect: at 22:10:39 there was only 1 prefix, even though the recovery event said "4 Prefix".
What I missed is I also got second alert at 22:15:39:
Trigger: BGP peer 10.22.44.50 has lost more than 20% of prefixes Trigger status: PROBLEM Trigger severity: Average Host IP: 10.22.0.1 Hostname: corertr-1 Event time: 2020.01.16 at 22:15:39 Item values: 1. Operational status for peer 10.22.44.50 (corertr-1:ciscoBgpPeerState[10.22.44.50]): established (6) 2. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 1 Prefix 3. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 1 Prefix Original event ID: 58649658
See: There is the PROBLEM state now, with 1 prefix (correct according to the item data). This recovered only at 22:55:39, when the item value got back to 4 again (verified from item values list as well), that's ok:
Trigger: BGP peer 10.22.44.50 has lost more than 20% of prefixes Trigger status: OK Trigger severity: Average Host IP: 10.22.0.1 Hostname: corertr-1 Event recovery time: 2020.01.16 at 22:55:39 Original event time: 2020.01.16 at 22:15:39 Item values: 1. Operational status for peer 10.22.44.50 (corertr-1:ciscoBgpPeerState[10.22.44.50]): established (6) 2. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 4 Prefix 3. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 4 Prefix Recovery event ID: 58654409 Original event ID: 58649658
Any ideas what could have caused the first event pair (OK+OK) and why it "recovered" with incorrect item data in the action message at 22:10:39? | ||||
Comment by Markku Leiniö [ 2020 Jan 26 ] | ||||
Due to the upcoming MariaDB upgrade I decided to go to Zabbix 4.4.4 to prevent the problems described in https://support.zabbix.com/browse/ZBX-16757. I'll let you know if this OK+OK problem continues. | ||||
Comment by Markku Leiniö [ 2020 Jan 30 ] | ||||
The same "OK+OK" problem occurs also on 4.4.4: Trigger: No prefixes received for peer 10.22.17.86 Trigger status: OK Trigger severity: High Host IP: 10.22.0.2 Hostname: corertr-2 Event time: 2020.01.30 at 06:40:38 Item values: 1. Operational status for peer 10.22.17.86 (corertr-2:ciscoBgpPeerState[10.22.17.86]): established (6) 2. Accepted prefixes for peer 10.22.17.86 (corertr-2:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.17.86]): 0 Prefix Original event ID: 60901593
Trigger: No prefixes received for peer 10.22.17.86 Trigger status: OK Trigger severity: High Host IP: 10.22.0.2 Hostname: corertr-2 Event recovery time: 2020.01.30 at 06:40:38 Original event time: 2020.01.30 at 06:40:38 Item values: 1. Operational status for peer 10.22.17.86 (corertr-2:ciscoBgpPeerState[10.22.17.86]): idle (1) 2. Accepted prefixes for peer 10.22.17.86 (corertr-2:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.17.86]): 0 Prefix Recovery event ID: 60901597 Original event ID: 60901593 Trigger expression for this was: {corertr-2:ciscoBgpPeerState[10.22.17.86].last(0)}=6 and {corertr-2:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.17.86].last(0)}=0 The values for the items were: ciscoBgpPeerState: 2020-01-30 06:45:38 idle (1) 2020-01-30 06:40:38 idle (1) 2020-01-30 06:35:38 established (6) 2020-01-30 06:30:38 established (6) ciscoVPNv4cbgpPeerPrefixAccepted: 2020-01-30 06:45:39 0 2020-01-30 06:40:38 0 2020-01-30 06:35:38 17 2020-01-30 06:30:38 17 Do note that the peer state value shown in the trigger activation message is incorrect: it shows established (6) even though the item values list shows idle (1) at 06:40:38. | ||||
Comment by Markku Leiniö [ 2021 Feb 22 ] | ||||
Btw this is still a problem with 5.0.x. | ||||
Comment by Markku Leiniö [ 2021 Apr 06 ] | ||||
This is not SNMP-specific problem, today I got similar OK+OK events from icmpping items, with 5.0.10. | ||||
Comment by Dmitry Krupornitsky [ 2021 May 17 ] | ||||
Hello, Markku What is a current state of this issue for you? If it is so, something wrong with recovery option for the trigger. Lets stop on the icmpping thing, could you please provide all data you have on this issue. | ||||
Comment by Markku Leiniö [ 2021 May 17 ] | ||||
Hi, thanks for getting back with this. The state is that there still occasionally is a problem event that says OK. Currently still running 5.0.10 server. The ping case on 2021-04-06, the action results: Problem starts: Trigger: Host is not pinging Trigger status: OK Trigger severity: High Host IP: 8.8.8.8 Hostname: Google-DNS-from-DUB Event time: 2021.04.06 at 06:44:47 Item values: 1. Ping (Google-DNS-from-DUB:icmpping[,3,500,68,1000]): Down (0) 2. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN* 3. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN* Original event ID: 79482048 (Trigger status = {{ {TRIGGER.STATUS}}}) Recovery: Trigger: Host is not pinging Trigger status: OK Trigger severity: High Host IP: 8.8.8.8 Hostname: Google-DNS-from-DUB Event recovery time: 2021.04.06 at 06:44:57 Original event time: 2021.04.06 at 06:44:47 Item values: 1. Ping (Google-DNS-from-DUB:icmpping[,3,500,68,1000]): Up (1) 2. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN* 3. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN* Recovery event ID: 79482051 Original event ID: 79482048 There was total of three similar OK+OK ping events at the same time (within three seconds) for three different hosts that were monitored by the same active proxy. Trigger expression:
{Google-DNS-from-DUB:icmpping[,{$I_PING_PACKETS:PING},{$I_PING_MSECINTERVAL:PING},{$I_PING_SIZE:PING},{$I_PING_TIMEOUT:PING}].max({$T_PING_PING_TIME})}=0 Macros: {$I_PING_PACKETS} = 3 {$I_PING_MSECINTERVAL} = 500 {$I_PING_SIZE} = 68 {$I_PING_TIMEOUT} = 1000 {$T_PING_PING_TIME} = #2 OK event generation = Expression PROBLEM event generation mode = Single Logs from the server (5.0.10): 874:20210406:064351.375 sending configuration data to proxy "Proxy-aws-dub-1" at "10.x.x.x", datalen 11343 848:20210406:064527.580 SNMP agent item "ifOperStatus[tunnel.347]" on host "fw-1" failed: first network error, wait for 15 seconds 863:20210406:064546.008 SNMP agent item "ifOutUcastPkts[tunnel.123]" on host "fw-1" failed: another network error, wait for 15 seconds 863:20210406:064605.024 SNMP agent item "ifInUcastPkts[tunnel.157]" on host "fw-1" failed: another network error, wait for 15 seconds 863:20210406:064624.039 temporarily disabling SNMP agent checks on host "fw-1": host unavailable 863:20210406:064828.178 enabling SNMP agent checks on host "fw-1": host became available 876:20210406:065851.682 sending configuration data to proxy "Proxy-aws-dub-1" at "10.x.x.x", datalen 11343 Logs from the proxy (5.0.10): 149745:20210406:064351.398 received configuration data from server at "172.x.x.x", datalen 11343 149745:20210406:065851.705 received configuration data from server at "172.x.x.x", datalen 11343 What I conclude from the logs (and the fact that three separate hosts suffered the same OK+OK case within a couple of seconds) is that there has been a network/firewall issue in between the Zabbix server and the proxy because the SNMP agent checks (to the firewall that is between server and proxy, and also between proxy and the pinged hosts) were failing according to the server logs just within a minute from the ping problems. Most of the OK+OK cases are still SNMP monitoring done by the Zabbix server, this ping case was very exceptional. Anyway: Zabbix detected the problems, but TRIGGER.STATUS macro said "OK" for some reason. | ||||
Comment by Dmitry Krupornitsky [ 2021 May 17 ] | ||||
What about the time, is it all the same on all proxies/agents? Is it possible to get history data for 2021.04.06 at 06:44:40 - 2021.04.06 at 06:48:00 for icmpping?
| ||||
Comment by Markku Leiniö [ 2021 May 17 ] | ||||
Server time is NTP-synced on both the server and proxy, to the local providers' official time sources, and by manual checkings they look identical (= if there is a difference, it is sub-0.5 second). Unfortunately there is no history data anymore for that event in April (according to Latest data values list). | ||||
Comment by Dmitry Krupornitsky [ 2021 May 17 ] | ||||
Could you please update this ticket with actual data as soon as it will occur next time. I will leave it open. We may also need the following: 1) history data before and after event 2) trigger expression and screenshots from server config 3) logs from server/agent around the event 4) all other info you may find informative I consider it to be a quick change from icmp fail to ok state so the trigger comes to a inconsistent state, but if it is a bug, I cant' clearly understand how it occurs. | ||||
Comment by Markku Leiniö [ 2021 May 19 ] | ||||
Many "OK+OK" events this morning, SNMP only, here is one: Start time: 2021-05-19 05:07:21 Recovery time: 2021-05-19 05:07:21 Host: site1-corertr-2 Trigger: BGP session with peer site2-corertr-1 (Inet) (10.1.22.18) has been restarted Trigger expression: {site1-corertr-2:BgpPeerEstablishedTime.VPN_A[10.1.22.18].change()}<0 and {site1-corertr-2:BgpPeerState.VPN_A[10.1.22.18].last(0)}=6 Items: BgpPeerEstablishedTime.VPN_A[10.1.22.18] = SNMPv2 1.3.6.1.2.1.15.3.1.16.10.1.22.18, numeric (unsigned), units = uptime, update interval = 5m, no preprocessing BgpPeerState.VPN_A[10.1.22.18] = SNMPv2 .1.3.6.1.2.1.15.3.1.2.10.1.22.18, numeric (unsigned), update interval = 5m, preprocessing: Discard unchanged with heartbeat = 6h History for BgpPeerEstablishedTime.VPN_A[10.1.22.18]: 2021-05-19 05:17:20 535 2021-05-19 05:12:19 234 2021-05-19 05:07:21 0 2021-05-19 05:02:20 6643828 2021-05-19 04:57:20 6643528 2021-05-19 04:52:20 6643228 History for BgpPeerState.VPN_A[10.1.22.18]: 2021-05-19 05:27:16 established (6) 2021-05-19 05:22:20 idle (1) 2021-05-19 05:12:19 established (6) 2021-05-19 05:07:21 idle (1) 2021-05-19 01:22:20 established (6) 2021-05-18 19:17:21 established (6) Zabbix server logs at the time of the event: 875:20210519:045828.590 sending configuration data to proxy "Proxy-aws-dub-1" at "10.3.99.5", datalen 11343 877:20210519:050043.818 sending configuration data to proxy "Proxy-site2-v6proxy-1" at "10.2.99.4", datalen 17043 851:20210519:050432.022 SNMP agent item "net.if.status[ifAdminStatus.Gi0]" on host "site2-inetrtr-1" failed: first network error, wait for 15 seconds 871:20210519:050451.724 SNMP agent item "net.if.status[ifAdminStatus.Gi0]" on host "site2-inetrtr-1" failed: another network error, wait for 15 seconds 878:20210519:050455.478 sending configuration data to proxy "Proxy-aws-sto-1" at "10.4.99.5", datalen 10385 866:20210519:050510.730 SNMP agent item "net.if.subif.status[ifHCOutOctets.Po10.802]" on host "site2-inetrtr-1" failed: another network error, wait for 15 seconds 871:20210519:050514.731 SNMP agent item "net.if.subif.status[ifOperStatus.Po10.802]" on host "site2-inetrtr-1" failed: another network error, wait for 15 seconds 866:20210519:050533.738 temporarily disabling SNMP agent checks on host "site2-inetrtr-1": host unavailable 864:20210519:050856.992 enabling SNMP agent checks on host "site2-inetrtr-1": host became available 855:20210519:050901.951 SNMP agent item "net.if.subif.status[ifOperStatus.Po10.800]" on host "site2-inetrtr-1" failed: first network error, wait for 15 seconds 865:20210519:050919.234 resuming SNMP agent checks on host "site2-inetrtr-1": connection restored 865:20210519:050926.424 SNMP agent item "net.if.status[ifOperStatus.Gi0/0/1]" on host "site2-inetrtr-1" failed: first network error, wait for 15 seconds 856:20210519:050943.641 resuming SNMP agent checks on host "site2-inetrtr-1": connection restored 878:20210519:051328.884 sending configuration data to proxy "Proxy-aws-dub-1" at "10.3.99.5", datalen 11343 875:20210519:051543.903 sending configuration data to proxy "Proxy-site2-v6proxy-1" at "10.2.99.4", datalen 17043 876:20210519:051955.644 sending configuration data to proxy "Proxy-aws-sto-1" at "10.4.99.5", datalen 10385 (Note about the logs: these SNMP agent errors are expected because the device was being rebooted) No Zabbix agents or proxies involved in this case. Other information: At the same time (05:07:21) there were four other events triggered (due to a device being rebooted, causing events in that device and nearby devices). One of them also resulted in this "OK+OK" at the same second: "Severity","Time","Recovery time","Status","Host","Problem","Duration" "Average","2021-05-19 05:07:21","2021-05-19 05:12:21","RESOLVED","site1-corertr-2","BGP peer site2-corertr-1 VPN_C (10.1.29.6) has lost more than 20% of prefixes","5m" "Average","2021-05-19 05:07:21","2021-05-19 05:07:21","RESOLVED","site1-corertr-2","BGP session with peer site2-corertr-1 (Inet) (10.1.22.18) has been restarted","0" "High","2021-05-19 05:07:21","2021-05-19 05:12:19","RESOLVED","site1-corertr-2","BGP peer site2-inetrtr-1 (10.1.22.2) is DOWN","4m 58s" "High","2021-05-19 05:07:21","2021-05-19 05:12:19","RESOLVED","site1-corertr-2","BGP peer site2-corertr-1 (Inet) (10.1.22.18) is DOWN","4m 58s" "High","2021-05-19 05:07:21","2021-05-19 05:07:21","RESOLVED","site1-corertr-2","No prefixes received for peer site2-corertr-1 (Inet) (10.1.22.18)","0" (= two of these five were "OK+OK" cases (those with duration of 0), three were normal "PROBLEM+OK" cases) My notes/conclusions: In the server logs or in the environment behaviour in general I don't see anything strange. Later in this same morning we got several other of these "OK+OK" cases as well, all from the SNMP items (just like the original ones I described originally when opening the ticket). The ping case we checked earlier was exceptional: I could not find any other "OK+OK" cases for icmpping items. We get these SNMP-related "OK+OK" cases a few times each month, basically whenever these BGP SNMP monitoring items cause events. These have been observed with Zabbix server versions 4.0.15, 4.4.4 (and later 4.4.x) and 5.0.x (currently 5.0.10).If I missed something specific, let me know. | ||||
Comment by Markku Leiniö [ 2021 May 19 ] | ||||
Current server is Debian Linux 10 (Buster), with almost latest updates (= updated automatically with security updates, other updates installed ~monthly). ii zabbix-agent 1:5.0.10-1+buster amd64 Zabbix network monitoring solution - agent | ||||
Comment by Markku Leiniö [ 2021 May 29 ] | ||||
For me it looks like the program logic has some problem in cases where one item of the problem expression gets a new value (that leads to the problem event being raised) AND another item in the same expression gets a new value right after the first item but still within the same second (and it leads to clearing the event immediately). Maybe the problem event is still queued somewhere in the application logic (waiting for the action to be executed) while the ok event has already been announced, and when the action executer takes the original problem event it sets the TRIGGER.STATUS macro value according to the current state of the event, not the state that the event was queued with. Just guessing here. I heard that the triggers are evaluated immediately whenever even just one item of the expression has a new value, so if the other item receives its new value a just split-second later then the problem is really short in duration. | ||||
Comment by Markku Leiniö [ 2021 May 30 ] | ||||
Yes, this is repeatable, even with a new Zabbix 5.4 installation (from repo.zabbix.com on Debian 10):
= The data is valid, first 0 (triggering the event) and then 1 (they can also be in the same second), but TRIGGER.STATUS shows incorrect value ("OK") in the problem email. Using a shell script I could see that having a "sleep 2" between sending the 0 and 1 values still causes this OK+OK to happen, and "sleep 3" (three-second delay) was required to get PROBLEM+OK. So this definitely looks like an action executer problem in Zabbix server: TRIGGER.STATUS value is assigned late during the action execution, not based on the actual event that is being handled. As a workaround, I see we need to change our message templates: instead of using TRIGGER.STATUS in the messages we need to hard-code the text ("PROBLEM" or "OK") in the problem and recovery templates. Let me know if you still need more information about this problem. | ||||
Comment by Markku Leiniö [ 2021 May 30 ] | ||||
But I also have a question about the problem shown in the comment above (https://support.zabbix.com/browse/ZBX-17194?focusedCommentId=540636&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-540636): Now that the values for the items mentioned in the trigger expression are evidently received and handled split-second apart in the SNMP queries (and the trigger is activated between the values), how can we prevent these kind of zero-second events from happening? Or is there no way to prevent zero-second problems when a trigger expression references two (or more) items that are really changing at the same time but Zabbix receives and handles them sequentially? | ||||
Comment by Dmitrijs Goloscapovs [ 2021 Aug 31 ] | ||||
Hello markkul, I would suggest to use {EVENT.STATUS} instead of {TRIGGER.STATUS}, as it expands to the value of event, it will prevent OK+OK situation. | ||||
Comment by Mickael Martin [ 2021 Oct 11 ] | ||||
I have got the same issue on 5.2.7. | ||||
Comment by Andris Zeila [ 2022 Jan 11 ] | ||||
The {TRIGGER.STATUS} macro returns trigger status at the moment of escalation, not trigger status at the event time. If a trigger is fired and recovered shortly, then it's quite possible it has already been recovered before problem escalation is processed and {TRIGGER.STATUS}will return the latest status - OK. To get the trigger status at the event time please use {EVENT.STATUS} macro as dgoloscapov mentioned. | ||||
Comment by Alexander Vladishev [ 2022 Jan 25 ] | ||||
Updated documentation: | ||||
Comment by Mickael Martin [ 2022 Jan 25 ] | ||||
A simple solution... Thank you very much ! |