-
Type:
Incident report
-
Resolution: Unresolved
-
Priority:
Trivial
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Environment:Rocky Linux 8.9
Zabbix server 6.0.29
zabbix-server is having 2 servers. One node is passive and one active
snmp v3
I have an issue with zabbix snmp checks.
For example, I am using "Generic by SNMP" template for dell hardware on nutanix.
There is an item "Generic SNMP: SNMP agent availability" where there is zabbix internal check:
zabbix[host,snmp,available]
Now, this is just an example of one random item from that template. Basically all the snmp items are having the same behaviour there.
The issue is, whenever I patch the software on that host, there will be network errors on host (probably because, the host is not available for the brief period of time during patching).
But that is not the issue here. The main issue is the last row of this log. Whenever this happens, then the host never be available again for monitoring.
98218:20240603:184913.462 temporarily disabling SNMP agent checks on host "host1.domain.com": interface unavailable
I have confirmed, from zabbix_server.conf file, there is
#UnavailableDelay=60
It should be default), so it should start working after 60 seconds of delay after such message.
But actually, it will never be available again (at least not before I restart the zabbix-server systemd service for active zabbix-server).
There will be no rows written to zabbix_server.log file after such row for that host. Even when I manually initiate a check on that host - nothing appears in the log file for that host.
The error I get from that host
SNMP "Not Available Timeout while connecting to ip:161
But I can confirm, there is no timeout. I have tried snmpwalk from zabbix server command line and I can retrieve the items just fine. Also, I can check snmp items via zabbix GUI as ;::well - there is no issue.
Result:
Whenever we get the last row to the log file "temporarily disabling SNMP", then it will never be available again. Unless I restart zabbix-server systemd service
98217:20240603:184526.188 resuming SNMP agent checks on host "host1.domain.com": connection restored 98195:20240603:184827.582 SNMP agent item "citAvgLatencyUsecs[HYCU-cd12ff48-ecce-448f-9a57-f853483b9f7f.]" on host "host1.domain.com" failed: first network error, wait for 15 seconds 98215:20240603:184846.424 SNMP agent item "citAvgLatencyUsecs[HYCU-cd12ff48-ecce-448f-9a57-f853483b9f7f.]" on host "host1.domain.com" failed: another network error, wait for 15 seconds 98218:20240603:184850.429 SNMP agent item "dstIOBandwidth[17]" on host "host1.domain.com" failed: another network error, wait for 15 seconds 98215:20240603:184909.453 SNMP agent item "dstAverageLatency[7]" on host "host1.domain.com" failed: another network error, wait for 15 seconds 98218:20240603:184913.462 temporarily disabling SNMP agent checks on host "host1.domain.com": interface unavailable
Expected:
After the row "temporarily disabling SNMP", there should be 60 seconds gap (according to UnavaialableDelay=60) parameter. And then it should start polling again.