[ZBX-16230] Unreachable host behavior uses item update interval instead of settings for unreachable Created: 2019 Jun 07 Updated: 2024 Apr 10 Resolved: 2020 Jun 04 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Proxy (P), Server (S) |
Affects Version/s: | 4.0.9 |
Fix Version/s: | 4.0.13rc1, 4.2.7rc1, 4.4.0alpha3, 4.4 (plan) |
Type: | Problem report | Priority: | Blocker |
Reporter: | Edgars Melveris | Assignee: | Andrejs Kozlovs |
Resolution: | Fixed | Votes: | 2 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: | ZBX-16230-4_0.patch queue.PNG zabbix_proxy_ZBX-16230.log | ||||||||||||||||
Issue Links: |
|
||||||||||||||||
Team: | Team C | ||||||||||||||||
Team: | Team C | ||||||||||||||||
Sprint: | Sprint 56 (Sep 2019), Sprint 55 (Aug 2019), Sprint 53 (Jun 2019), Sprint 54 (Jul 2019), Sprint 64 (May 2020) | ||||||||||||||||
Story Points: | 2 |
Description |
I've set up 2 hosts, that monitor the same SNMP host, one directly form server, one from proxy. # UnreachablePeriod=45 # UnavailableDelay=60 # UnreachableDelay=15 Each host has 600 items with update interval 30 minutes. 23083:20190607:163217.557 SNMP agent item "snmp[snmp1]" on host "snmp-test-server" failed: first network error, wait for 15 seconds 23084:20190607:170217.810 temporarily disabling SNMP agent checks on host "snmp-test-server": host unavailable Notice, how host becomes unavailable exactly 30 minutes after the first network error. While the host is in this unknown state, other items are also not checked at all, but the host is shown as available. 1488:20190607:165950.398 Zabbix agent item "net.if.discovery" on host "test" failed: first network error, wait for 15 seconds 1490:20190607:170452.030 temporarily disabling Zabbix agent checks on host "test": host unavailable Another strange thing is, that the queue grows for the proxy, but not for the server. The host is monitored from the proxy and server, when it goes down, only the proxy reports delayed data: |
Comments |
Comment by Edgars Melveris [ 2019 Jun 07 ] |
It seems, that it's sometimes possible to kick the host out of the strange state by initiating "check now" function on any item. |
Comment by Edgars Melveris [ 2019 Jun 10 ] |
Attaching proxy log with debug level to trace for pollers: zabbix_proxy_ZBX-16230.log This was done on a host with 600 items with update interval 30 min and 1 item with 5 min. Some important parts from it: 14498:20190610:100258.433 __zbx_zbx_setproctitle() title:'poller #1 [got 1 values in 0.008875 sec, getting values]' 14498:20190610:100258.433 In get_values() 14498:20190610:100258.433 In DCconfig_get_poller_items() poller_type:0 14498:20190610:100258.433 End of DCconfig_get_poller_items():1 14498:20190610:100258.434 In substitute_key_macros() data:'snmp[snmp20]' 14498:20190610:100258.434 End of substitute_key_macros():SUCCEED data:'snmp[snmp20]' 14498:20190610:100258.434 In substitute_simple_macros() data:'161' 14498:20190610:100258.434 In substitute_simple_macros() data:'public' 14498:20190610:100258.434 In substitute_key_macros() data:'SNMPv2-MIB::sysName.0' 14498:20190610:100258.434 End of substitute_key_macros():SUCCEED data:'SNMPv2-MIB::sysName.0' 14498:20190610:100258.434 In get_values_snmp() host:'snmp-test-proxy' addr:'127.0.0.1' num:1 14498:20190610:100258.434 In zbx_snmp_open_session() 14498:20190610:100258.434 SNMP [public@127.0.0.1:161] 14498:20190610:100258.434 End of zbx_snmp_open_session() 14498:20190610:100258.434 In zbx_snmp_process_standard() 14498:20190610:100258.434 In zbx_snmp_translate() OID:'SNMPv2-MIB::sysName.0' 14498:20190610:100258.434 End of zbx_snmp_translate() oid_translated:'SNMPv2-MIB::sysName.0' 14498:20190610:100258.434 In zbx_snmp_get_values() num:1 level:0 14498:20190610:100306.442 zbx_snmp_get_values() snmp_synch_response() status:2 s_snmp_errno:-24 errstat:-1 mapping_num:1 14498:20190610:100306.442 End of zbx_snmp_get_values():NETWORK_ERROR 14498:20190610:100306.443 End of zbx_snmp_process_standard():NETWORK_ERROR 14498:20190610:100306.443 In zbx_snmp_close_session() 14498:20190610:100306.443 End of zbx_snmp_close_session() 14498:20190610:100306.443 getting SNMP values failed: Timeout while connecting to "127.0.0.1:161". 14498:20190610:100306.443 End of get_values_snmp() 14498:20190610:100306.444 In zbx_deactivate_item_host() hostid:10433 itemid:41578 type:4 14498:20190610:100306.444 query [txnlev:1] [begin;] 14498:20190610:100306.446 query [txnlev:1] [update hosts set snmp_errors_from=1560150186,snmp_disable_until=1560150201 where hostid=10433] 14498:20190610:100306.448 query [txnlev:1] [commit;] 14498:20190610:100306.451 SNMP agent item "snmp[snmp20]" on host "snmp-test-proxy" failed: first network error, wait for 15 seconds 14498:20190610:100306.451 zbx_deactivate_item_host() errors_from:1560150186 available:1 14498:20190610:100306.451 End of zbx_deactivate_item_host() 14498:20190610:100306.452 End of get_values():1 14498:20190610:100306.452 __zbx_zbx_setproctitle() title:'poller #1 [got 1 values in 8.018766 sec, getting values]' 14498:20190610:100306.452 In get_values() 14498:20190610:100306.452 In DCconfig_get_poller_items() poller_type:0 14498:20190610:100306.453 End of DCconfig_get_poller_items():0 14498:20190610:100306.453 In DCconfig_get_poller_nextcheck() poller_type:0 14498:20190610:100306.453 End of DCconfig_get_poller_nextcheck():1560150187 14498:20190610:100306.453 End of get_values():0 14498:20190610:100306.454 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.001356 sec, idle 1 sec]' 14498:20190610:100307.454 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.001356 sec, getting values]' 14498:20190610:100307.455 In get_values() 14498:20190610:100307.455 In DCconfig_get_poller_items() poller_type:0 14498:20190610:100307.455 End of DCconfig_get_poller_items():0 14498:20190610:100307.455 In DCconfig_get_poller_nextcheck() poller_type:0 14498:20190610:100307.455 End of DCconfig_get_poller_nextcheck():1560150188 14498:20190610:100307.455 End of get_values():0 14498:20190610:100307.456 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.001426 sec, idle 1 sec]' 14498:20190610:100308.457 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.001426 sec, getting values]' 14498:20190610:100308.457 In get_values() 14498:20190610:100308.457 In DCconfig_get_poller_items() poller_type:0 14498:20190610:100308.457 End of DCconfig_get_poller_items():0 14498:20190610:100308.458 In DCconfig_get_poller_nextcheck() poller_type:0 14498:20190610:100308.458 End of DCconfig_get_poller_nextcheck():1560150429 At first, the timings are set correctly "snmp_errors_from=1560150186,snmp_disable_until=1560150201" but then nextcheck changed to 1560150429 14498:20190610:100703.682 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.000978 sec, getting values]' 14498:20190610:100703.682 In get_values() 14498:20190610:100703.682 In DCconfig_get_poller_items() poller_type:0 14498:20190610:100703.682 End of DCconfig_get_poller_items():0 14498:20190610:100703.682 In DCconfig_get_poller_nextcheck() poller_type:0 14498:20190610:100703.682 End of DCconfig_get_poller_nextcheck():1560150429 14498:20190610:100703.682 End of get_values():0 14498:20190610:100703.682 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.000350 sec, idle 5 sec]' 14498:20190610:100708.684 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.000350 sec, getting values]' 14498:20190610:100708.684 In get_values() 14498:20190610:100708.685 In DCconfig_get_poller_items() poller_type:0 14498:20190610:100708.685 End of DCconfig_get_poller_items():0 14498:20190610:100708.685 In DCconfig_get_poller_nextcheck() poller_type:0 14498:20190610:100708.685 End of DCconfig_get_poller_nextcheck():1560150429 14498:20190610:100708.685 End of get_values():0 14498:20190610:100708.685 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.001038 sec, idle 1 sec]' 14498:20190610:100709.686 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.001038 sec, getting values]' 14498:20190610:100709.686 In get_values() 14498:20190610:100709.686 In DCconfig_get_poller_items() poller_type:0 14498:20190610:100709.686 End of DCconfig_get_poller_items():0 14498:20190610:100709.686 In DCconfig_get_poller_nextcheck() poller_type:0 14498:20190610:100709.686 End of DCconfig_get_poller_nextcheck():1560150759 14498:20190610:100709.686 End of get_values():0 14498:20190610:100709.686 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.000409 sec, idle 5 sec]' 14498:20190610:100714.689 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.000409 sec, getting values]' 14498:20190610:100714.689 In get_values() 14498:20190610:100714.689 In DCconfig_get_poller_items() poller_type:0 14498:20190610:100714.690 End of DCconfig_get_poller_items():0 14498:20190610:100714.690 In DCconfig_get_poller_nextcheck() poller_type:0 14498:20190610:100714.690 End of DCconfig_get_poller_nextcheck():1560150759 14498:20190610:100714.690 End of get_values():0 14498:20190610:100714.690 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.001412 sec, idle 5 sec]' 14498:20190610:101230.034 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.000589 sec, getting values]' 14498:20190610:101230.034 In get_values() 14498:20190610:101230.034 In DCconfig_get_poller_items() poller_type:0 14498:20190610:101230.034 End of DCconfig_get_poller_items():0 14498:20190610:101230.034 In DCconfig_get_poller_nextcheck() poller_type:0 14498:20190610:101230.034 End of DCconfig_get_poller_nextcheck():1560150759 14498:20190610:101230.034 End of get_values():0 14498:20190610:101230.034 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.000354 sec, idle 5 sec]' 14498:20190610:101235.043 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.000354 sec, getting values]' 14498:20190610:101235.043 In get_values() 14498:20190610:101235.043 In DCconfig_get_poller_items() poller_type:0 14498:20190610:101235.043 End of DCconfig_get_poller_items():0 14498:20190610:101235.043 In DCconfig_get_poller_nextcheck() poller_type:0 14498:20190610:101235.043 End of DCconfig_get_poller_nextcheck():1560150759 14498:20190610:101235.043 End of get_values():0 14498:20190610:101235.043 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.000380 sec, idle 4 sec]' 14498:20190610:101239.045 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.000380 sec, getting values]' 14498:20190610:101239.045 In get_values() 14498:20190610:101239.045 In DCconfig_get_poller_items() poller_type:0 14498:20190610:101239.046 End of DCconfig_get_poller_items():0 14498:20190610:101239.046 In DCconfig_get_poller_nextcheck() poller_type:0 14498:20190610:101239.046 End of DCconfig_get_poller_nextcheck():1560150760 14498:20190610:101239.046 End of get_values():0 14498:20190610:101239.046 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.000710 sec, idle 1 sec]' 14498:20190610:101240.047 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.000710 sec, getting values]' 14498:20190610:101240.047 In get_values() 14498:20190610:101240.047 In DCconfig_get_poller_items() poller_type:0 14498:20190610:101240.047 End of DCconfig_get_poller_items():0 14498:20190610:101240.048 In DCconfig_get_poller_nextcheck() poller_type:0 14498:20190610:101240.048 End of DCconfig_get_poller_nextcheck():1560150761 14498:20190610:101240.048 End of get_values():0 14498:20190610:101240.048 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.001332 sec, idle 1 sec]' 14498:20190610:101708.663 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.001320 sec, getting values]' 14498:20190610:101708.663 In get_values() 14498:20190610:101708.664 In DCconfig_get_poller_items() poller_type:0 14498:20190610:101708.664 End of DCconfig_get_poller_items():0 14498:20190610:101708.664 In DCconfig_get_poller_nextcheck() poller_type:0 14498:20190610:101708.664 End of DCconfig_get_poller_nextcheck():1560151029 14498:20190610:101708.664 End of get_values():0 14498:20190610:101708.665 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.001412 sec, idle 1 sec]' 14499:20190610:101709.612 enabling SNMP agent checks on host "snmp-test-proxy": host became available 14498:20190610:101709.665 __zbx_zbx_setproctitle() title:'poller #1 [got 0 values in 0.001412 sec, getting values]' 14498:20190610:101709.665 In get_values() 14498:20190610:101709.665 In DCconfig_get_poller_items() poller_type:0 14498:20190610:101709.666 End of DCconfig_get_poller_items():1 14498:20190610:101709.666 In substitute_key_macros() data:'snmp[snmp301]' 14498:20190610:101709.666 End of substitute_key_macros():SUCCEED data:'snmp[snmp301]' 14498:20190610:101709.666 In substitute_simple_macros() data:'161' 14498:20190610:101709.666 In substitute_simple_macros() data:'public' 14498:20190610:101709.666 In substitute_key_macros() data:'SNMPv2-MIB::sysName.0' 14498:20190610:101709.667 End of substitute_key_macros():SUCCEED data:'SNMPv2-MIB::sysName.0' 14498:20190610:101709.667 In get_values_snmp() host:'snmp-test-proxy' addr:'127.0.0.1' num:1 14498:20190610:101709.667 In zbx_snmp_open_session() 14498:20190610:101709.667 SNMP [public@127.0.0.1:161] 14498:20190610:101709.667 End of zbx_snmp_open_session() 14498:20190610:101709.667 In zbx_snmp_process_standard() 14498:20190610:101709.667 In zbx_snmp_translate() OID:'SNMPv2-MIB::sysName.0' 14498:20190610:101709.667 End of zbx_snmp_translate() oid_translated:'SNMPv2-MIB::sysName.0' 14498:20190610:101709.668 In zbx_snmp_get_values() num:1 level:0 14498:20190610:101709.671 zbx_snmp_get_values() snmp_synch_response() status:0 s_snmp_errno:0 errstat:0 mapping_num:1 14498:20190610:101709.671 In zbx_snmp_set_result() type:4 14498:20190610:101709.671 In zbx_snmp_get_octet_string() 14498:20190610:101709.671 zbx_snmp_get_octet_string() full value:'STRING: zabb40.localdomain' hint:'255a' 14498:20190610:101709.671 End of zbx_snmp_get_octet_string():'zabb40.localdomain' 14498:20190610:101709.671 End of zbx_snmp_set_result():SUCCEED 14498:20190610:101709.672 End of zbx_snmp_get_values():SUCCEED 14498:20190610:101709.672 End of zbx_snmp_process_standard():SUCCEED 14498:20190610:101709.672 In zbx_snmp_close_session() 14498:20190610:101709.672 End of zbx_snmp_close_session() 14498:20190610:101709.672 End of get_values_snmp() 14498:20190610:101709.672 In zbx_activate_item_host() hostid:10433 itemid:42429 type:4 14498:20190610:101709.672 End of zbx_activate_item_host() 14498:20190610:101709.672 End of get_values():1 14498:20190610:101709.672 __zbx_zbx_setproctitle() title:'poller #1 [got 1 values in 0.007457 sec, idle 1 sec]' |
Comment by Vladislavs Sokurenko [ 2019 Jun 10 ] |
Is 4.0.7 also affected or any other version ? Knowing affected versions could help to identify what caused the issue faster. |
Comment by Vladislavs Sokurenko [ 2019 Jun 10 ] |
Could be related to or caused by |
Comment by Edgars Melveris [ 2019 Jun 10 ] |
My tests where done on 4.0.9, I could downgrade and test on 4.0.7 |
Comment by Edgars Melveris [ 2019 Jun 10 ] |
Hello, even 4.0.0 seems to be affected. |
Comment by dimir [ 2019 Jun 10 ] |
The setup contains 1 host with 600 items. Item update interval is 30m. This gives 1 check every 3 seconds average. But as you can see above the function DCconfig_get_poller_nextcheck() called from get_values() returns +300 seconds as nextcheck at some point. In result - Unreachable* and Unavailable* settings are ignored and we wait for 5 minutes before setting host unavailable. Thanks to zux, changing update interval of 1 item to 2s, with Timeout=8, UnreachableDelay is respected (15s+8s=24s): 21003:20190610:172518.697 SNMP agent item "snmp[snmp900]" on host "snmp-test-server" failed: first network error, wait for 15 seconds 21004:20190610:172542.118 SNMP agent item "snmp[snmp900]" on host "snmp-test-server" failed: another network error, wait for 15 seconds 21004:20190610:172606.143 temporarily disabling SNMP agent checks on host "snmp-test-server": host unavailable Why does DCconfig_get_poller_nextcheck() return +300 seconds at some point? This is what we need to find out. |
Comment by Vladislavs Sokurenko [ 2019 Jun 11 ] |
Could you please provide refresh unsupported value |
Comment by dimir [ 2019 Jun 11 ] |
Looks like in case of first network error we must select MIN("update interval", "UnreachableDelay") for the nextcheck. Needs further investigation. |
Comment by Glebs Ivanovskis [ 2019 Jun 11 ] |
Hold on. We probably don't want to check items earlier than needed just because we want to figure out if the host is reachable...
Zabbix "seeds" checks based on itemids and whatnot (depending on item type). Moreover, if these are SNMP items and bulk polling is enabled it is theoretically possible that all host's items are polled at once. Since Unreachable* settings operate on the host level it could be that they don't have any effect until next scheduled item check and this is an expected (but undocumented?) behaviour. |
Comment by Edgars Melveris [ 2019 Jun 11 ] |
Refresh unsupported is set to 10 seconds. |
Comment by Glebs Ivanovskis [ 2019 Jun 11 ] |
What you call "strange logic" can be "by design" in Zabbix. Have you found which version of Zabbix is not affected by this bug? How do you expect Zabbix to behave in such case? |
Comment by Edgars Melveris [ 2019 Jun 11 ] |
I would expect it to work as documented:
https://www.zabbix.com/documentation/4.0/manual/appendix/items/unreachability I am yet to find a version, where this works as expected |
Comment by Glebs Ivanovskis [ 2019 Jun 11 ] |
Well, it leaves a lot of room for interpretation. As far as I remember, it looks a bit different from the code. If we ignore for a moment bulk polling of SNMP and JMX items, flexible and scheduling intervals, unsupported items, etc., it goes roughly like this. Zabbix server/proxy will schedule item to be checked at Unix timestamps equal to itemid % UpdateInterval + UpdateInterval × N, where N is a whole number. If there is a network error at timestamp T, all items belonging to the same host and scheduled to be checked within the range from T to T + UnreachableDelay are rescheduled to the next check (basically, substitute N → N + 1 in the first formula). So the code guarantees that host won't be touched for UnreachableDelay seconds, but there is no guarantee that it will be checked immediately after UnreachableDelay seconds will have passed. UnavailableDelay and UnreachablePeriod act in a similar fashion. You can think of them as "silence intervals" on top of regular item polling schedule. Since polling schedule is partially determined by itemids, there can be large gaps in regular polling schedule. Since host is considered unavailable only after a second network error, there are inconsistencies in how server and proxy see this situation. Proxy does not communicate to server that host is unreachable before second network happens, but reschedules some of the checks. Server considers host reachable and sees that some checks were missed by proxy. |
Comment by Edgars Melveris [ 2019 Jun 11 ] |
Ok, I've tested, that on version 3.4.7 it works as expected. 24373:20190611:225523.050 SNMP agent item "snmp[snmp127]" on host "snmp-test-server" failed: first network error, wait for 15 seconds 24377:20190611:225546.324 SNMP agent item "snmp[snmp9]" on host "snmp-test-server" failed: another network error, wait for 15 seconds 24377:20190611:225609.351 temporarily disabling SNMP agent checks on host "snmp-test-server": host unavailable 24377:20190611:225713.473 enabling SNMP agent checks on host "snmp-test-server": host became available I'm not sure why it's 23 seconds, because this instance has timeout set to 4 seconds, but I will look for a version, when this changed. Update 26361:20190611:235705.208 SNMP agent item "snmp[snmp109]" on host "snmp-test-server" failed: first network error, wait for 15 seconds 26366:20190611:235728.849 SNMP agent item "snmp[snmp114]" on host "snmp-test-server" failed: another network error, wait for 15 seconds 26366:20190611:235752.026 temporarily disabling SNMP agent checks on host "snmp-test-server": host unavailable 3.4.9: 26706:20190612:000909.472 SNMP agent item "snmp[snmp113]" on host "snmp-test-server" failed: first network error, wait for 15 seconds 26709:20190612:001109.521 temporarily disabling SNMP agent checks on host "snmp-test-server": host unavailable Notice how there is no "another network error" in the second, which probably means, that the UnreachablePeriod=45 is honored. |
Comment by Glebs Ivanovskis [ 2019 Jun 12 ] |
Great finding, zux! |
Comment by dimir [ 2019 Jun 13 ] |
Looks like we have 2 issues here:
Agree? |
Comment by Andrejs Kozlovs [ 2019 Aug 29 ] |
Fixed in:
Documentation updated: |