[ZBX-16628] "first network error" stops monitoring other items on the same host Created: 2019 Sep 12  Updated: 2024 Apr 10  Resolved: 2020 Feb 01

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: None
Affects Version/s: 4.0.11
Fix Version/s: 4.0.13rc1, 4.2.7rc1, 4.4.0beta1, 4.4 (plan)

Type: Problem report Priority: Trivial
Reporter: Kazuo Ito Assignee: Andrejs Tumilovics
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Latest_data_01.png     PNG File diagram.png     PNG File test1.png     PNG File test1_02.png     PNG File test2.png     PNG File test2_02.png     PNG File test3.png     PNG File test3_02.png     File zabbix_server.conf     Text File zabbix_server.log    
Team: Team C
Sprint: Sprint 56 (Sep 2019), Sprint 57 (Oct 2019), Sprint 58 (Nov 2019), Sprint 59 (Dec 2019), Sprint 60 (Jan 2020)
Story Points: 0.5

 Description   

1)Create 3 items

Name                : test1
key                 : system.cpu.util[,idle,avg5]
Type of information : Numeric(float)
Update interval     : 3m

Name                : test2
key                 : vm.memory.size[available]
Type of information : Numeric(unsigned)
Units               : B
Update interval     : 3m


Name                : test3
key                 : agent.ping
Type of information : Numeric(unsigned)
Update interval     : 4m

2)Confirm that monitoring has started with the "Latest data".
system.cpu.util[,idle,avg5]

  • 2019/09/12 15:23:53 98.8194
  • 2019/09/12 15:20:53 98.4406
  • 2019/09/12 15:17:53 96.221

vm.memory.size[available]

  • 2019/09/12 15:23:54 2402906112
  • 2019/09/12 15:20:54 2405601280
  • 2019/09/12 15:17:54 2403467264

agent.ping

  • 2019/09/12 15:20:55 1
  • 2019/09/12 15:16:55 1

3)stop zabbix agent

4)Check log with tail command

]# tail -f /var/log/zabbix/zabbix_server.log 
  3421:20190912:152542.713 server #15 started [poller #3]
  3422:20190912:152542.724 server #16 started [poller #4]
  3423:20190912:152542.735 server #17 started [poller #5]
  3427:20190912:152542.845 server #20 started [alerter #1]
  3428:20190912:152542.847 server #21 started [preprocessing manager #1]
  3425:20190912:152542.849 server #18 started [unreachable poller #1]
  3426:20190912:152542.859 server #19 started [alert manager #1]
  3432:20190912:152543.061 server #24 started [preprocessing worker #3]
  3429:20190912:152543.079 server #22 started [preprocessing worker #1]
  3430:20190912:152543.080 server #23 started [preprocessing worker #2]
  3419:20190912:152653.108 Zabbix agent item "system.cpu.util[,idle,avg5]" on host "Zabbix server" failed: first network error, wait for 15 seconds

5)Start zabbix agent as soon as "first network error" is output.

  3419:20190912:152653.108 Zabbix agent item "system.cpu.util[,idle,avg5]" on host "Zabbix server" failed: first network error, wait for 15 seconds
  3425:20190912:152953.309 resuming Zabbix agent checks on host "Zabbix server": connection restored

Why doesn't it work after 15 seconds?

Check the latest data


There is no history from 23 to 32 minutes.


There is no history from 20 to 36 minutes.



 Comments   
Comment by richlv [ 2019 Sep 12 ]

In step 5, the first failure timestamp is 15:08:53, but screenshots and log in step 1 have the range starting at 15:26:53 - perhaps the snippet in step 5 is from another test?

Comment by Kazuo Ito [ 2019 Sep 12 ]

I'm sorry, I put another test result.
Here is the correct one.

  3419:20190912:152653.108 Zabbix agent item "system.cpu.util[,idle,avg5]" on host "Zabbix server" failed: first network error, wait for 15 seconds
  3425:20190912:152953.309 resuming Zabbix agent checks on host "Zabbix server": connection restored
Comment by Andris Zeila [ 2019 Sep 12 ]

To avoid polling items at 'disabled' periods (flexible/scheduled checks) unreachable host handling was changed in ZBX-13579. Instead of shifting all checks to be polled at the end of host unreachability period the checks were polled according to their schedule and simply not processed if host is unreachable.

But that would explain the test1 item polling results - it failed poll at 26:53 and succeeded at it's next scheduled interval 29:53. Test 2 could not be polled at it's scheduled time 25:54 because of host unreachability. However it should have been polled at 29:54, when the host should be reachable after successful test1 poll at 29:53. I'm not sure it could be explained with internal delays.

And currently I cannot explain why test3 was not polled at 24:55 and then at 32:55, that seems strange.

However the unreachable host polling logic was changed in ZBX-16230. Now normal and flexible checks will be shifted by the host unreachability period (unless the checks are disabled at that time, then it will be shifted to the end of the disabled period). Sheduled checks will be polled according to their schedule and not processed if host is unreachable at that time (current behaviour).

Comment by Andris Zeila [ 2019 Sep 16 ]

As I already tried to explain in comment - the behaviour of unreachable checks was changed in ZBX-13579 and then a compromise solution was done in ZBX-16230, which restores old logic unless scheduled checks are involved.

Regading Zabbix 2.2.19 - it's the same behavior. When host becomes unreachable its items are not being monitored for the next UnreachableDelay seconds (15 being the default value). See that the next check was checked after this delay:

32251:20170927:003218.347 Zabbix agent item "system.cpu.switches" on host "Zabbix server" failed: first network error, wait for 15 seconds
 32252:20170927:003233.596 Zabbix agent item "system.cpu.util[,idle]" on host "Zabbix server" failed: another network error, wait for 15 seconds
 32252:20170927:003248.600 resuming Zabbix agent checks on host "Zabbix server": connection restored

Server does not try to check the same item after 15 seconds. On contrary - if there was another item scheduled to be checked during those 15 seconds, it will have priority over the failed item after the 15 seconds have passed.

Comment by Oleksii Zagorskyi [ 2019 Sep 16 ]

There was related change in ZBX-4284 for 2.2.11

Comment by Kazuo Ito [ 2019 Sep 17 ]

I remember ZBX-4284.
I remember that it was changed again with ZBX-10215.
But I didn't know about ZBX-14417.

Is the ZBX-13579 fix shifting to the next schedule when an item becomes a network error?

The Zabbix4.0.11 result that I confirmed was as follows.

15:20:55 test3 monitored OK
15:23:53 test1 monitored OK
15:23:53 test1 "first network error"
15:23:54 test2 monitored OK  <- internal delays?
15:23:54 test3 do nothing / reschedule?
15:26:53 test1 do nothing / reschedule?
15:26:54 test2 do nothing / reschedule?
15:28:54 test3 do nothing / reschedule?
15:29:53 connection restored / test1 monitored OK
15:29:54 test2 do nothing   <- why?
15:32:53 test1 monitored OK
15:32:54 test2 monitored OK
15:32:54 test3 do nothing   <- why?
15:36:55 test3 monitored OK

I'm not sure it wasn't monitored at 15:29:54 and 15:32:54.


With the changed of ZBX-13579, the next check will not be done in 15 seconds.

The manual states that:

A host is treated as unreachable after a failed check (network error, timeout) by Zabbix, SNMP, IPMI or JMX agents. Note that Zabbix agent active checks do not influence host availability in any way.

From that moment UnreachableDelay defines how often a host is rechecked using one of the items (including LLD rules) in this unreachability situation and such rechecks will be performed already by unreachable pollers (or IPMI pollers for IPMI checks). By default it is 15 seconds before the next check.

I look different from the manual.

Comment by Kazuo Ito [ 2019 Sep 17 ]

I stopped zabbix agent and waited until "host unavailable" was displayed.

  7556:20190917:170715.479 Starting Zabbix Server. Zabbix 4.0.11 (revision 53bb6bc0f0).

  7578:20190917:170715.660 server #13 started [poller #1]

  7586:20190917:170715.752 server #18 started [unreachable poller #1]

  7578:20190917:172659.449 Zabbix agent item "vm.memory.size[available]" on host "Zabbix server" failed: first network error, wait for 15 seconds
  7586:20190917:172959.624 temporarily disabling Zabbix agent checks on host "Zabbix server": host unavailable
  7586:20190917:173259.824 enabling Zabbix agent checks on host "Zabbix server": host became available

system.cpu.util[,idle,avg5] / Update interval 3minute

  • 2019/09/17 17:36:00 98.3635
  • 2019/09/17 17:24:00 99.0129

vm.memory.size[available] / Update interval 3minute

  • 2019/09/17 17:35:59 2395328512
  • 2019/09/17 17:32:59 2396860416
  • 2019/09/17 17:23:59 2396987392

agent.ping / Update interval 4minute

  • 2019/09/17 17:37:01 1
  • 2019/09/17 17:25:01 1

Sorted by time.

17:23:59 test2 monitored ok
17:24:00 test1 monitored ok
17:25:01 test3 monitored ok
17:26:59 test2 "first network error"
17:27:00 test1 nothing
17:29:01 test3 nothing
17:29:59 host unavailable
17:30:00 test1 nothing
17:32:59 host became available / test2 host became available
17:33:00 test1 nothing  <- why?
17:33:01 test3 nothing  <- why?
17:35:59 test2 monitored ok
17:36:00 test1 monitored ok
17:37:01 test3 monitored ok


I tried changing the "Update interval" of the item "test1" to 10 seconds.

  7842:20190917:175633.365 Starting Zabbix Server. Zabbix 4.0.11 (revision 53bb6bc0f0).

  7866:20190917:175633.552 server #13 started [poller #1]

  7871:20190917:175633.607 server #18 started [unreachable poller #1]

  7866:20190917:181440.103 Zabbix agent item "system.cpu.util[,idle,avg5]" on host "Zabbix server" failed: first network error, wait for 15 seconds
  7871:20190917:181500.141 Zabbix agent item "system.cpu.util[,idle,avg5]" on host "Zabbix server" failed: another network error, wait for 15 seconds
  7871:20190917:181520.164 Zabbix agent item "system.cpu.util[,idle,avg5]" on host "Zabbix server" failed: another network error, wait for 15 seconds
  7871:20190917:181540.213 temporarily disabling Zabbix agent checks on host "Zabbix server": host unavailable
  7871:20190917:181640.402 enabling Zabbix agent checks on host "Zabbix server": host became available

For some reason, the state changes every 20 seconds.
"first network error" -> "another network error" -> "another network error" -> "host unavailable"


system.cpu.util[,idle,avg5] / Update interval 10sec

vm.memory.size[available] / Update interval 3minute

agent.ping / Update interval 4minute

Sorted by time.

18:11:59 test2 monitored ok
18:13:01 test3 monitored ok
18:14:30 test1 monitored ok
18:14:59 test2 nothing
18:14:40 test1 "first network error"
18:15:00 test1 "another network error"
18:15:20 test1 "another network error"
18:15:40 host unavailable
18:16:40 host became available / test1 host became available
18:16:50 test1 monitored ok
18:17:00 test1 monitored ok
18:17:01 test3 monitored ok
18:17:59 test2 nothing  <- why?
18:20:59 test2 monitored ok
18:21:01 test3 monitored ok

It seems to me that the next monitoring will not be performed when it is time to monitor in an unreachable state.

Comment by Alexander Vladishev [ 2020 Feb 01 ]

Updated documentation:

  • "Unreachable/unavailable host settings" page updated in 4.0, 4.4, 5.0 versions
Generated at Tue Apr 29 07:01:54 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.