[ZBX-7881] Host not becoming available after UnavailablePeriod has passed and host is back up Created: 2014 Feb 27  Updated: 2017 May 30  Resolved: 2014 Apr 15

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 2.2.2
Fix Version/s: None

Type: Incident report Priority: Critical
Reporter: Adrian Pinzari Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: agent, snmp, unavailable
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS release 6.5 (Final) x64


Issue Links:
Duplicate
duplicates ZBX-5788 Agents becoming unreachable and never... Closed

 Description   

In our case a host was marked as Unavailable due to SNMP items failing to report data back. However, when the host was restarted and the checks resumed, the host was not being marked as available again and as a result no data was collected until I restarted zabbix_server process.

Here is the relevant output from zabbix_server.log:

3778:20140227:011532.839 SNMP agent item "ifInErrors[port4]" on host "XXXX" failed: first network error, wait for 15 seconds
3783:20140227:011550.394 SNMP agent item "ifInErrors[port4]" on host "XXXX" failed: another network error, wait for 15 seconds
3783:20140227:011608.512 SNMP agent item "ifOutOctets[port1]" on host "XXXX" failed: another network error, wait for 15 seconds
3783:20140227:011626.635 temporarily disabling SNMP agent checks on host "XXXX": hos t unavailable

The UnavailablePeriod is the default 60 seconds, and it is my impression that the UnavailablePoller was not even attempting to check.



 Comments   
Comment by Adrian Pinzari [ 2014 Feb 27 ]

I will add this as a follow-up, and this may turn into a RTFM case rather than a bug. In the config file, here is the section in question:

### Option: UnavailableDelay
#	How often host is checked for availability during the unavailability period, in seconds.
#
# Mandatory: no
# Range: 1-3600
# Default:
# UnavailableDelay=60

Does this mean that if UnavailableDelay is not explicitly set, then it will not take effect? i.e. I need to have

### Option: UnavailableDelay
#	How often host is checked for availability during the unavailability period, in seconds.
#
# Mandatory: no
# Range: 1-3600
# Default:
# UnavailableDelay=60
UnavailableDelay=60
Comment by Oleksii Zagorskyi [ 2014 Mar 05 ]

Could be related to ZBX-5788

Comment by richlv [ 2014 Mar 16 ]

no, default is set even if the corresponding line is commented out in the config file.

what's the busy rate for the unreachable pollers ? how many of them do you have ?
is this problem repeatable ? if it happens again, can you please try stracing unreachable poller and see whether it does anything at all.

Comment by Adrian Pinzari [ 2014 Mar 25 ]

Hello,

The issue seems to have repeated again:

1585:20140325:104311.079 SNMP agent item "system.cpu.load" on host "XXX" failed: first network error, wait for 15 seconds
1593:20140325:104328.623 SNMP agent item "system.cpu.load" on host "XXX" failed: another network error, wait for 15 seconds
1593:20140325:104346.861 SNMP agent item "ifOperStatus[port4]" on host "XXX" failed: another network error, wait for 15 seconds
1593:20140325:104405.067 temporarily disabling SNMP agent checks on host "XXX": host unavailable

My StartPollersUnreachable is set to 1 (default) and here is the data for the business around the time the host was disabled:
2014.Mar.25 11:04:50 10.5932
2014.Mar.25 11:03:50 10.5932
2014.Mar.25 11:02:50 10.5593
2014.Mar.25 11:01:50 10.4237
2014.Mar.25 11:00:50 5.3738
2014.Mar.25 10:59:50 9.3898
2014.Mar.25 10:58:50 10.7966
2014.Mar.25 10:57:51 11.8136
2014.Mar.25 10:56:50 10.7797
2014.Mar.25 10:55:50 11.6441
2014.Mar.25 10:54:50 10.9341
2014.Mar.25 10:53:50 10.6423
2014.Mar.25 10:52:50 10.6967
2014.Mar.25 10:51:50 10.661
2014.Mar.25 10:50:50 10.595
2014.Mar.25 10:49:50 10.5932
2014.Mar.25 10:48:50 6.9661
2014.Mar.25 10:47:50 9
2014.Mar.25 10:46:50 10.6967
2014.Mar.25 10:45:50 10.5103
2014.Mar.25 10:44:50 11.1695
2014.Mar.25 10:43:50 15.9661
2014.Mar.25 10:42:50 5.2712
2014.Mar.25 10:41:50 5.3051
2014.Mar.25 10:40:50 5.4915
2014.Mar.25 10:39:50 5.3738
2014.Mar.25 10:38:50 5.339
2014.Mar.25 10:37:51 5.9492
2014.Mar.25 10:36:51 6.304
2014.Mar.25 10:35:50 5.4068
2014.Mar.25 10:34:50 5.3559

Comment by richlv [ 2014 Mar 25 ]

for those who see this problem, without restarting server, can you please do strace on unreachable poller and see what is it doing, if anything ?

you can see process type (and pid) in server startup messages, as well as in ps/top output since zabbix 2.2

Comment by Juris Miščenko (Inactive) [ 2014 Apr 08 ]

Unfortunately, I couldn't reproduce this. Hosts become available very soon after connectivity is re-established. If there are special some special conditions that you notice in your setups that might influence this, please report them to us.
For the time being, I will leave this issue in the state of requiring information, but if this doesn't repeat itself in the near future, the issue will be closed as unreproducable.

Generated at Fri Apr 26 13:18:21 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.