-
Problem report
-
Resolution: Cannot Reproduce
-
Trivial
-
None
-
5.2.4
-
None
-
Ubuntu 18.04 / 20.04
Since a couple of weeks, for hosts being monitored with SNMP, proxies randomly stop monitoring hosts with SNMP for up to an hour.
In /etc/zabbix/zabbix_proxy.conf, we have the following settings"
Timeout=15 UnavailableDelay=300 UnreachableDelay=15 UnreachablePeriod=120
Here is an example in the log (with warning log level) where things work as expected:
123006.208 SNMP agent item "sensor.temperature" on host "pdu-302" failed: first network error, wait for 15 seconds 123051.023 SNMP agent item "phase.loadstate[3]" on host "pdu-302" failed: another network error, wait for 15 seconds 123206.062 temporarily disabling SNMP agent checks on host "pdu-302": host unavailable 123206.174 enabling SNMP agent checks on host "pdu-302": host became available
There was an issue getting SNMP data, the proxy tried again shortly after, marked as unuavailabe, and short after was marked as available again.
Here is an example of unexpected behavior (log level 4 enabled after host was marked unavailable):
45296:20210317:130733.726 SNMP agent item "ilo.temperature[ambient]" on host "usvh016" failed: first network error, wait for 15 seconds 45679:20210317:130933.643 temporarily disabling SNMP agent checks on host "usvh016": host unavailable 45777:20210317:140933.116 enabling SNMP agent checks on host "usvh016": host became available 45383:20210317:141024.397 In get_values_snmp() host:'usvh016' addr:'usvh016-ilo' num:1
Zabbix only reports a single issue, then 2 minutes later, immediately marks the host as unavailable, and starts monitoring again after 1 hour (tcpdump confirmed no SNMP traffic to the host in between). This happens with random hosts (different hosts each time) and just a couple of times per day.
This seemed behavior seemed to have started randomly somewhere in Zabbix 5.0.x, and we still have it with Zabbix 5.2.4. Before that, everything was stable. The poller process usage is less then 25% at it's peak, unreachable pollers less than 4%. The host has plenty of resources. It happens.