[ZBX-17194] Trigger goes first OK and then recovers to OK, instead of PROBLEM and then OK Created: 2020 Jan 17  Updated: 2024 Apr 10  Resolved: 2022 Jan 25

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 4.0.15, 5.0.10
Fix Version/s: None

Type: Problem report Priority: Trivial
Reporter: Markku Leiniö Assignee: Zabbix Development Team
Resolution: Won't fix Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian 9 Stretch, 4 vCPU, 8 GB, 240 NVPS, database on a separate VM


Team: Team D
Sprint: Sprint 77 (Jun 2021), Sprint 78 (Jul 2021), Sprint 79 (Aug 2021), Sprint 80 (Sep 2021), Sprint 81 (Oct 2021), Sprint 82 (Nov 2021), Sprint 83 (Dec 2021), Sprint 84 (Jan 2022)
Story Points: 0.1

 Description   

In some cases there is no PROBLEM event for a trigger but an OK event, followed by another OK event.

Configured action operations message:

Trigger: {EVENT.NAME}
Trigger status: {TRIGGER.STATUS}
Trigger severity: {TRIGGER.SEVERITY}
Host IP: {HOST.IP}
Hostname: {HOST.HOST}
Event time: {EVENT.DATE} at {EVENT.TIME}

Item values:
1. {ITEM.NAME1} ({HOST.NAME1}:{ITEM.KEY1}): {ITEM.VALUE1}
2. {ITEM.NAME2} ({HOST.NAME2}:{ITEM.KEY2}): {ITEM.VALUE2}
3. {ITEM.NAME3} ({HOST.NAME3}:{ITEM.KEY3}): {ITEM.VALUE3}

Original event ID: {EVENT.ID}

Configured action recovery operations message:

Trigger: {EVENT.NAME}
Trigger status: {TRIGGER.STATUS}
Trigger severity: {TRIGGER.SEVERITY}
Host IP: {HOST.IP}
Hostname: {HOST.HOST}
Event recovery time: {EVENT.RECOVERY.DATE} at {EVENT.RECOVERY.TIME}
Original event time: {EVENT.DATE} at {EVENT.TIME}

Item values:
1. {ITEM.NAME1} ({HOST.NAME1}:{ITEM.KEY1}): {ITEM.VALUE1}
2. {ITEM.NAME2} ({HOST.NAME2}:{ITEM.KEY2}): {ITEM.VALUE2}
3. {ITEM.NAME3} ({HOST.NAME3}:{ITEM.KEY3}): {ITEM.VALUE3}

Recovery event ID: {EVENT.RECOVERY.ID}
Original event ID: {EVENT.ID}

 

Trigger expression for the example below:

{corertr-1:ciscoBgpPeerState[10.22.44.50].last(0)}

=6 and {corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50].last(0)}<0.8*{corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50].avg(86400)}

(meaning: trigger if BGP peer state is established AND accepted prefixes are less than 80% of the average prefixes during the last one day)

 

Actual first message received:

Trigger: BGP peer 10.83.44.50 has lost more than 20% of prefixes
Trigger status: OK
Trigger severity: Average
Host IP: 10.22.0.1
Hostname: corertr-1
Event time: 2020.01.16 at 22:10:39

Item values:
1. Operational status for peer 10.22.44.50 (corertr-1:ciscoBgpPeerState[10.22.44.50]): established (6)
2. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 1 Prefix
3. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 1 Prefix

Original event ID: 58648908

(Comment: This trigger is valid because the prefixes dropped from 4 to 1. But why is the state OK and not PROBLEM?)

Then the recovery message received:

Trigger: BGP peer 10.83.22.50 has lost more than 20% of prefixes
Trigger status: OK
Trigger severity: Average
Host IP: 10.22.0.1
Hostname: corertr-1
Event recovery time: 2020.01.16 at 22:10:39
Original event time: 2020.01.16 at 22:10:39

Item values:
1. Operational status for peer 10.22.44.50 (corertr-1:ciscoBgpPeerState[10.22.44.50]): established (6)
2. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 4 Prefix
3. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 4 Prefix

Recovery event ID: 58648909
Original event ID: 58648908

Observations:

  • The original event time and the event recovery timestamps are the same.
  • There is no PROBLEM state at all, first is OK and then OK.
  • (I believe those two observations are interconnected: the first state is OK only if the recovery happens at the same time.)
  • The trigger event is valid as such I believe: the number of prefixes dropped momentarily, according to the item values.

Questions:

  • How is it possible that the event is triggered and resolved at the same time?
  • Why TRIGGER.STATUS is OK in the first message, and not PROBLEM as usual?

I have similar observations with other metrics as well (non-BGP related). This happens every now and then, but I haven't got any other examples right away.



 Comments   
Comment by Markku Leiniö [ 2020 Jan 17 ]

I have to give more information about this. These are the item values (5 minute interval):

2020-01-16 22:15:39 1
2020-01-16 22:10:39 1
2020-01-16 22:05:39 4
2020-01-16 22:00:39 4

According to that data the recovery information above was just plain incorrect: at 22:10:39 there was only 1 prefix, even though the recovery event said "4 Prefix".

 

What I missed is I also got second alert at 22:15:39:

 

Trigger: BGP peer 10.22.44.50 has lost more than 20% of prefixes
Trigger status: PROBLEM
Trigger severity: Average
Host IP: 10.22.0.1
Hostname: corertr-1
Event time: 2020.01.16 at 22:15:39
Item values:
1. Operational status for peer 10.22.44.50 (corertr-1:ciscoBgpPeerState[10.22.44.50]): established (6)
2. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 1 Prefix
3. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 1 Prefix
Original event ID: 58649658

 

See: There is the PROBLEM state now, with 1 prefix (correct according to the item data).

This recovered only at 22:55:39, when the item value got back to 4 again (verified from item values list as well), that's ok:

 

Trigger: BGP peer 10.22.44.50 has lost more than 20% of prefixes
Trigger status: OK
Trigger severity: Average
Host IP: 10.22.0.1
Hostname: corertr-1
Event recovery time: 2020.01.16 at 22:55:39
Original event time: 2020.01.16 at 22:15:39
Item values:
1. Operational status for peer 10.22.44.50 (corertr-1:ciscoBgpPeerState[10.22.44.50]): established (6)
2. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 4 Prefix
3. Accepted prefixes for peer 10.22.44.50 (corertr-1:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.44.50]): 4 Prefix
Recovery event ID: 58654409
Original event ID: 58649658

 

 

Any ideas what could have caused the first event pair (OK+OK) and why it "recovered" with incorrect item data in the action message at 22:10:39?

Comment by Markku Leiniö [ 2020 Jan 26 ]

Due to the upcoming MariaDB upgrade I decided to go to Zabbix 4.4.4 to prevent the problems described in https://support.zabbix.com/browse/ZBX-16757. I'll let you know if this OK+OK problem continues.

Comment by Markku Leiniö [ 2020 Jan 30 ]

The same "OK+OK" problem occurs also on 4.4.4:

Trigger: No prefixes received for peer 10.22.17.86
Trigger status: OK
Trigger severity: High
Host IP: 10.22.0.2
Hostname: corertr-2
Event time: 2020.01.30 at 06:40:38
Item values:
1. Operational status for peer 10.22.17.86 (corertr-2:ciscoBgpPeerState[10.22.17.86]): established (6)
2. Accepted prefixes for peer 10.22.17.86 (corertr-2:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.17.86]): 0 Prefix
Original event ID: 60901593

 

Trigger: No prefixes received for peer 10.22.17.86
Trigger status: OK
Trigger severity: High
Host IP: 10.22.0.2
Hostname: corertr-2
Event recovery time: 2020.01.30 at 06:40:38
Original event time: 2020.01.30 at 06:40:38
Item values:
1. Operational status for peer 10.22.17.86 (corertr-2:ciscoBgpPeerState[10.22.17.86]): idle (1)
2. Accepted prefixes for peer 10.22.17.86 (corertr-2:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.17.86]): 0 Prefix
Recovery event ID: 60901597
Original event ID: 60901593

Trigger expression for this was:

{corertr-2:ciscoBgpPeerState[10.22.17.86].last(0)}=6 and
{corertr-2:ciscoVPNv4cbgpPeerPrefixAccepted[10.22.17.86].last(0)}=0

The values for the items were:

ciscoBgpPeerState:
2020-01-30 06:45:38	idle (1)
2020-01-30 06:40:38	idle (1)
2020-01-30 06:35:38	established (6)
2020-01-30 06:30:38	established (6)

ciscoVPNv4cbgpPeerPrefixAccepted:
2020-01-30 06:45:39	0
2020-01-30 06:40:38	0
2020-01-30 06:35:38	17
2020-01-30 06:30:38	17

Do note that the peer state value shown in the trigger activation message is incorrect: it shows established (6) even though the item values list shows idle (1) at 06:40:38.

Comment by Markku Leiniö [ 2021 Feb 22 ]

Btw this is still a problem with 5.0.x.

Comment by Markku Leiniö [ 2021 Apr 06 ]

This is not SNMP-specific problem, today I got similar OK+OK events from icmpping items, with 5.0.10.

Comment by Dmitry Krupornitsky [ 2021 May 17 ]

Hello, Markku

What is a current state of this issue for you? If it is so, something wrong with recovery option for the trigger. Lets stop on the icmpping thing, could you please provide all data you have on this issue.

Comment by Markku Leiniö [ 2021 May 17 ]

Hi, thanks for getting back with this. The state is that there still occasionally is a problem event that says OK. Currently still running 5.0.10 server.

The ping case on 2021-04-06, the action results:

Problem starts:

Trigger: Host is not pinging
Trigger status: OK
Trigger severity: High
Host IP: 8.8.8.8
Hostname: Google-DNS-from-DUB
Event time: 2021.04.06 at 06:44:47

Item values:
1. Ping (Google-DNS-from-DUB:icmpping[,3,500,68,1000]): Down (0)
2. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN*
3. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN*

Original event ID: 79482048

 (Trigger status = {{

{TRIGGER.STATUS}

}})

Recovery:

Trigger: Host is not pinging
Trigger status: OK
Trigger severity: High
Host IP: 8.8.8.8
Hostname: Google-DNS-from-DUB
Event recovery time: 2021.04.06 at 06:44:57
Original event time: 2021.04.06 at 06:44:47

Item values:
1. Ping (Google-DNS-from-DUB:icmpping[,3,500,68,1000]): Up (1)
2. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN*
3. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN*

Recovery event ID: 79482051
Original event ID: 79482048

There was total of three similar OK+OK ping events at the same time (within three seconds) for three different hosts that were monitored by the same active proxy.

Trigger expression:

 

{Google-DNS-from-DUB:icmpping[,{$I_PING_PACKETS:PING},{$I_PING_MSECINTERVAL:PING},{$I_PING_SIZE:PING},{$I_PING_TIMEOUT:PING}].max({$T_PING_PING_TIME})}=0
Macros:
{$I_PING_PACKETS} = 3
{$I_PING_MSECINTERVAL} = 500
{$I_PING_SIZE} = 68
{$I_PING_TIMEOUT} = 1000
{$T_PING_PING_TIME} = #2

OK event generation = Expression

PROBLEM event generation mode = Single

Logs from the server (5.0.10):

   874:20210406:064351.375 sending configuration data to proxy "Proxy-aws-dub-1" at "10.x.x.x", datalen 11343
   848:20210406:064527.580 SNMP agent item "ifOperStatus[tunnel.347]" on host "fw-1" failed: first network error, wait for 15 seconds
   863:20210406:064546.008 SNMP agent item "ifOutUcastPkts[tunnel.123]" on host "fw-1" failed: another network error, wait for 15 seconds
   863:20210406:064605.024 SNMP agent item "ifInUcastPkts[tunnel.157]" on host "fw-1" failed: another network error, wait for 15 seconds
   863:20210406:064624.039 temporarily disabling SNMP agent checks on host "fw-1": host unavailable
   863:20210406:064828.178 enabling SNMP agent checks on host "fw-1": host became available
   876:20210406:065851.682 sending configuration data to proxy "Proxy-aws-dub-1" at "10.x.x.x", datalen 11343

Logs from the proxy (5.0.10):

149745:20210406:064351.398 received configuration data from server at "172.x.x.x", datalen 11343
149745:20210406:065851.705 received configuration data from server at "172.x.x.x", datalen 11343

What I conclude from the logs (and the fact that three separate hosts suffered the same OK+OK case within a couple of seconds) is that there has been a network/firewall issue in between the Zabbix server and the proxy because the SNMP agent checks (to the firewall that is between server and proxy, and also between proxy and the pinged hosts) were failing according to the server logs just within a minute from the ping problems.

Most of the OK+OK cases are still SNMP monitoring done by the Zabbix server, this ping case was very exceptional.

Anyway: Zabbix detected the problems, but TRIGGER.STATUS macro said "OK" for some reason.

Comment by Dmitry Krupornitsky [ 2021 May 17 ]

What about the time, is it all the same on all proxies/agents? Is it possible to get history data for 2021.04.06 at 06:44:40 - 2021.04.06 at 06:48:00 for icmpping?

 

 

Comment by Markku Leiniö [ 2021 May 17 ]

Server time is NTP-synced on both the server and proxy, to the local providers' official time sources, and by manual checkings they look identical (= if there is a difference, it is sub-0.5 second).

Unfortunately there is no history data anymore for that event in April (according to Latest data values list).

Comment by Dmitry Krupornitsky [ 2021 May 17 ]

Could you please update this ticket with actual data as soon as it will occur next time. I will leave it open.

We may also need the following:

1) history data before and after event

2) trigger expression and screenshots from server config

3) logs from server/agent around the event

4) all other info you may find informative

I consider it to be a quick change from icmp fail to ok state so the trigger comes to a inconsistent state, but if it is a bug, I cant' clearly understand how it occurs.

Comment by Markku Leiniö [ 2021 May 19 ]

Many "OK+OK" events this morning, SNMP only, here is one:

Start time: 2021-05-19 05:07:21
Recovery time: 2021-05-19 05:07:21
Host: site1-corertr-2
Trigger: BGP session with peer site2-corertr-1 (Inet) (10.1.22.18) has been restarted

Trigger expression:

{site1-corertr-2:BgpPeerEstablishedTime.VPN_A[10.1.22.18].change()}<0 and
{site1-corertr-2:BgpPeerState.VPN_A[10.1.22.18].last(0)}=6

Items:

BgpPeerEstablishedTime.VPN_A[10.1.22.18] = SNMPv2 1.3.6.1.2.1.15.3.1.16.10.1.22.18, numeric (unsigned), units = uptime, update interval = 5m, no preprocessing

BgpPeerState.VPN_A[10.1.22.18] = SNMPv2 .1.3.6.1.2.1.15.3.1.2.10.1.22.18, numeric (unsigned), update interval = 5m, preprocessing: Discard unchanged with heartbeat = 6h

History for BgpPeerEstablishedTime.VPN_A[10.1.22.18]:

2021-05-19 05:17:20	535
2021-05-19 05:12:19	234
2021-05-19 05:07:21	0
2021-05-19 05:02:20	6643828
2021-05-19 04:57:20	6643528
2021-05-19 04:52:20	6643228

History for BgpPeerState.VPN_A[10.1.22.18]:

2021-05-19 05:27:16	established (6)
2021-05-19 05:22:20	idle (1)
2021-05-19 05:12:19	established (6)
2021-05-19 05:07:21	idle (1)
2021-05-19 01:22:20	established (6)
2021-05-18 19:17:21	established (6)

Zabbix server logs at the time of the event:

   875:20210519:045828.590 sending configuration data to proxy "Proxy-aws-dub-1" at "10.3.99.5", datalen 11343
   877:20210519:050043.818 sending configuration data to proxy "Proxy-site2-v6proxy-1" at "10.2.99.4", datalen 17043
   851:20210519:050432.022 SNMP agent item "net.if.status[ifAdminStatus.Gi0]" on host "site2-inetrtr-1" failed: first network error, wait for 15 seconds
   871:20210519:050451.724 SNMP agent item "net.if.status[ifAdminStatus.Gi0]" on host "site2-inetrtr-1" failed: another network error, wait for 15 seconds
   878:20210519:050455.478 sending configuration data to proxy "Proxy-aws-sto-1" at "10.4.99.5", datalen 10385
   866:20210519:050510.730 SNMP agent item "net.if.subif.status[ifHCOutOctets.Po10.802]" on host "site2-inetrtr-1" failed: another network error, wait for 15 seconds
   871:20210519:050514.731 SNMP agent item "net.if.subif.status[ifOperStatus.Po10.802]" on host "site2-inetrtr-1" failed: another network error, wait for 15 seconds
   866:20210519:050533.738 temporarily disabling SNMP agent checks on host "site2-inetrtr-1": host unavailable
   864:20210519:050856.992 enabling SNMP agent checks on host "site2-inetrtr-1": host became available
   855:20210519:050901.951 SNMP agent item "net.if.subif.status[ifOperStatus.Po10.800]" on host "site2-inetrtr-1" failed: first network error, wait for 15 seconds
   865:20210519:050919.234 resuming SNMP agent checks on host "site2-inetrtr-1": connection restored
   865:20210519:050926.424 SNMP agent item "net.if.status[ifOperStatus.Gi0/0/1]" on host "site2-inetrtr-1" failed: first network error, wait for 15 seconds
   856:20210519:050943.641 resuming SNMP agent checks on host "site2-inetrtr-1": connection restored
   878:20210519:051328.884 sending configuration data to proxy "Proxy-aws-dub-1" at "10.3.99.5", datalen 11343
   875:20210519:051543.903 sending configuration data to proxy "Proxy-site2-v6proxy-1" at "10.2.99.4", datalen 17043
   876:20210519:051955.644 sending configuration data to proxy "Proxy-aws-sto-1" at "10.4.99.5", datalen 10385

(Note about the logs: these SNMP agent errors are expected because the device was being rebooted)

No Zabbix agents or proxies involved in this case.

Other information:

At the same time (05:07:21) there were four other events triggered (due to a device being rebooted, causing events in that device and nearby devices). One of them also resulted in this "OK+OK" at the same second:

"Severity","Time","Recovery time","Status","Host","Problem","Duration"
"Average","2021-05-19 05:07:21","2021-05-19 05:12:21","RESOLVED","site1-corertr-2","BGP peer site2-corertr-1 VPN_C (10.1.29.6) has lost more than 20% of prefixes","5m"
"Average","2021-05-19 05:07:21","2021-05-19 05:07:21","RESOLVED","site1-corertr-2","BGP session with peer site2-corertr-1 (Inet) (10.1.22.18) has been restarted","0"
"High","2021-05-19 05:07:21","2021-05-19 05:12:19","RESOLVED","site1-corertr-2","BGP peer site2-inetrtr-1 (10.1.22.2) is DOWN","4m 58s"
"High","2021-05-19 05:07:21","2021-05-19 05:12:19","RESOLVED","site1-corertr-2","BGP peer site2-corertr-1 (Inet) (10.1.22.18) is DOWN","4m 58s"
"High","2021-05-19 05:07:21","2021-05-19 05:07:21","RESOLVED","site1-corertr-2","No prefixes received for peer site2-corertr-1 (Inet) (10.1.22.18)","0"

(= two of these five were "OK+OK" cases (those with duration of 0), three were normal "PROBLEM+OK" cases)

My notes/conclusions:

In the server logs or in the environment behaviour in general I don't see anything strange.

Later in this same morning we got several other of these "OK+OK" cases as well, all from the SNMP items (just like the original ones I described originally when opening the ticket).

The ping case we checked earlier was exceptional: I could not find any other "OK+OK" cases for icmpping items.

We get these SNMP-related "OK+OK" cases a few times each month, basically whenever these BGP SNMP monitoring items cause events.

These have been observed with Zabbix server versions 4.0.15, 4.4.4 (and later 4.4.x) and 5.0.x (currently 5.0.10).If I missed something specific, let me know.

Comment by Markku Leiniö [ 2021 May 19 ]

Current server is Debian Linux 10 (Buster), with almost latest updates (= updated automatically with security updates, other updates installed ~monthly).

ii zabbix-agent 1:5.0.10-1+buster amd64 Zabbix network monitoring solution - agent
ii zabbix-apache-conf 1:5.0.10-1+buster all Zabbix network monitoring solution - apache configuration for front-end
ii zabbix-frontend-php 1:5.0.10-1+buster all Zabbix network monitoring solution - PHP front-end
ii zabbix-release 1:5.0-1+buster all Zabbix official repository configuration
ii zabbix-server-mysql 1:5.0.10-1+buster amd64 Zabbix network monitoring solution - server (MySQL)

Comment by Markku Leiniö [ 2021 May 29 ]

For me it looks like the program logic has some problem in cases where one item of the problem expression gets a new value (that leads to the problem event being raised) AND another item in the same expression gets a new value right after the first item but still within the same second (and it leads to clearing the event immediately).

Maybe the problem event is still queued somewhere in the application logic (waiting for the action to be executed) while the ok event has already been announced, and when the action executer takes the original problem event it sets the TRIGGER.STATUS macro value according to the current state of the event, not the state that the event was queued with. Just guessing here.

I heard that the triggers are evaluated immediately whenever even just one item of the expression has a new value, so if the other item receives its new value a just split-second later then the problem is really short in duration.

Comment by Markku Leiniö [ 2021 May 30 ]

Yes, this is repeatable, even with a new Zabbix 5.4 installation (from repo.zabbix.com on Debian 10):

  1. Configure media type Email to send the emails (using proper email server information in your network), add "Status: {TRIGGER.STATUS}" to message templates "Problem" and "Problem recovery"
  2. Configure user Admin to have your email address
  3. Enable action "Report problems to Zabbix administrators"
  4. Add host "Testhost"  (group "Test", doesn't matter)
  5. For Testhost add item "Testitem", type = Zabbix trapper, key = testitem
  6. For Testhost add trigger "Testitem is zero", expression = last(/Testhost/testitem)=0
  7. Refresh server config or wait two minutes for the server to catch up with the config change
  8. Install zabbix-sender
  9. Test the item: zabbix_sender -z 127.0.0.1 -s Testhost -k testitem -o 1
  10. See that you see the data in Latest data (troubleshoot and repeat if necessary)
  11. Test the trigger: zabbix_sender -z 127.0.0.1 -s Testhost -k testitem -o 0
  12. Make sure that you receive the alert email ("Problem started at ..."), containing "Status: PROBLEM" (troubleshoot email if necessary)
  13. Reset the trigger: zabbix_sender -z 127.0.0.1 -s Testhost -k testitem -o 1
  14. See that you receive the email ("Problem has been resolved at ...") containing "Status: OK"
  15. Now send these two values rapidly (maybe enter them in a shell script and run it, but the problem is also repeatable from shell prompt if entered rapidly):
    1. zabbix_sender -z 127.0.0.1 -s Testhost -k testitem -o 0
    2. zabbix_sender -z 127.0.0.1 -s Testhost -k testitem -o 1
  16. See that you receive two emails:
    1. Problem email: "Status: OK" (which is incorrect)
    2. Resolved in 0s (or 1s) email: "Status: OK"
  17. See the Latest data for Testitem:
    1. 2021-05-30 11:18:17 1
      2021-05-30 11:18:16 0

       

= The data is valid, first 0 (triggering the event) and then 1 (they can also be in the same second), but TRIGGER.STATUS shows incorrect value ("OK") in the problem email.

Using a shell script I could see that having a "sleep 2" between sending the 0 and 1 values still causes this OK+OK to happen, and "sleep 3" (three-second delay) was required to get PROBLEM+OK.

So this definitely looks like an action executer problem in Zabbix server: TRIGGER.STATUS value is assigned late during the action execution, not based on the actual event that is being handled.

As a workaround, I see we need to change our message templates: instead of using TRIGGER.STATUS in the messages we need to hard-code the text ("PROBLEM" or "OK") in the problem and recovery templates.

Let me know if you still need more information about this problem.

Comment by Markku Leiniö [ 2021 May 30 ]

But I also have a question about the problem shown in the comment above (https://support.zabbix.com/browse/ZBX-17194?focusedCommentId=540636&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-540636): Now that the values for the items mentioned in the trigger expression are evidently received and handled split-second apart in the SNMP queries (and the trigger is activated between the values), how can we prevent these kind of zero-second events from happening?

Or is there no way to prevent zero-second problems when a trigger expression references two (or more) items that are really changing at the same time but Zabbix receives and handles them sequentially?

Comment by Dmitrijs Goloscapovs [ 2021 Aug 31 ]

Hello markkul,

I would suggest to use {EVENT.STATUS} instead of {TRIGGER.STATUS}, as it expands to the value of event, it will prevent OK+OK situation.

Comment by Mickael Martin [ 2021 Oct 11 ]

I have got the same issue on 5.2.7.

Comment by Andris Zeila [ 2022 Jan 11 ]

The

{TRIGGER.STATUS} macro returns trigger status at the moment of escalation, not trigger status at the event time. If a trigger is fired and recovered shortly, then it's quite possible it has already been recovered before problem escalation is processed and {TRIGGER.STATUS}

will return the latest status - OK.

To get the trigger status at the event time please use {EVENT.STATUS} macro as dgoloscapov mentioned.

Comment by Alexander Vladishev [ 2022 Jan 25 ]

Updated documentation:
 - Supported macros for 6.0, 5.4 and 5.0

Comment by Mickael Martin [ 2022 Jan 25 ]

A simple solution... Thank you very much !

Generated at Mon May 05 07:26:16 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.