[ZBX-7585] zabbix stopped gathering data Created: 2013 Dec 26  Updated: 2017 May 30  Resolved: 2016 Apr 28

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 2.2.1
Fix Version/s: None

Type: Incident report Priority: Minor
Reporter: sles Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 1
Labels: timeout
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Centos 6/x86-64


Attachments: File zabbix_server.log.bz2     File zabbix_server.log.gz     File zabbix_server.log.gz    
Issue Links:
Duplicate
is duplicated by ZBX-7568 Zabbix stops to collect data after fi... Closed

 Description   

Hello!

just noticed that there is no data from one of hosts for more then 10 minutes, found in log:

24613:20131226:134300.808 Zabbix agent item "net.if.out[eth1,bytes]" on host "inetgw-nsk" failed: first network error, wait for 15 seconds

I noticed this at 13:55 or so, so I restarted zabbix server and immediately get:

28701:20131226:135742.513 resuming Zabbix agent checks on host "inetgw-nsk": connection restored

This is definitely a bug, even f there was network problem far more then 15 seconds passed...



 Comments   
Comment by Marc [ 2013 Dec 26 ]

Sounds not like a bug report.
You may find community support on IRC and forums and commercial support here

Comment by richlv [ 2013 Dec 26 ]

this might be very hard to trace down - i don't see anything that could be done right now.

next time this happens, you could try stracing poller processes and seeing whether any of the is stuck - and if so, on what.

Comment by sles [ 2013 Dec 27 ]

Marc , not bug report?
I hit bug, and reported about it.
Still not? :-D

richlv, I'll definitely trace

Comment by Marc [ 2013 Dec 27 ]

"Problem" might appear more appropriate to me than "bug"

Comment by sles [ 2013 Dec 27 ]

I'm sure this is bug . Period.

Comment by Oleksii Zagorskyi [ 2013 Dec 27 ]

Sles, would be correct to provide from very beginning zabbix_server.conf's UnreachablePeriod and UnavailableDelay

Check this page https://www.zabbix.com/documentation/2.2/manual/appendix/items/unreachability you can find there some useful points.

Comment by sles [ 2013 Dec 27 ]

Oleksiy Zagorskyi , they are default.
Anyway, there was no such "temporarily disabling Zabbix agent checks on host [New host]: host unavailable" message in log, only what I already provided.
There was no info in web interface that host in not monitored too.

Comment by sles [ 2013 Dec 31 ]

Hello!

Just upgraded zabbix agent on several hosts, and on all of them got about 10 minites of no data:

14580:20131231:075602.521 Zabbix agent item "net.if.in[eth0,bytes]" on host "inetgw-nsk" failed: first network error, wait for 15 seconds
14589:20131231:080515.459 resuming Zabbix agent checks on host "inetgw-nsk": connection restored

14571:20131231:080655.216 Zabbix agent item "vfs.fs.size[/,used]" on host "ast-ngdu2.xnet.belkam.com" failed: first network error, wait for 15 seconds
14592:20131231:081515.430 resuming Zabbix agent checks on host "ast-ngdu2.xnet.belkam.com": connection restored

etc.

Why so long timeout?

Thank you!

Comment by Marc [ 2013 Dec 31 ]

Did you already:

Comment by sles [ 2014 Jan 02 ]

partially

1. yes
2. yes
3. not yet, will turn debug on , if problem occur again- I'll have no logs
4. no, because I still don't know how to reproduce problem
5. no, because this is definitely not agent problem

Comment by sles [ 2014 Jan 02 ]

ok. about 4- just rebooted one of servers.
no zabbix agent related traffic (only snmp and icmp) beetwen zabbix server and server with agent for last 5 minutes.
zabbix shows host status Под наблюдением

will turn debug on later (quite busy right now) and try to reproduce.

now connection is restored:

23283:20140102:075108.042 Zabbix agent item "net.if.in[eth0,bytes]" on host "asterisk.p98.belkam.com" failed: first network error, wait for 15 seconds
23328:20140102:075231.300 item [asterisk.p98.belkam.com:calls] became not supported: SNMP error: (noSuchName) There is no such variable name in this MIB.
23328:20140102:075631.078 item [asterisk.p98.belkam.com:calls] became supported
23296:20140102:075947.513 resuming Zabbix agent checks on host "asterisk.p98.belkam.com": connection restored

as you can see, snmp started far earlier.

Thank you!

Comment by sles [ 2014 Jan 02 ]

full log with debug 4

Comment by sles [ 2014 Jan 02 ]

full log with debug 4

Comment by sles [ 2014 Jan 02 ]

full log with debug 4

Comment by sles [ 2014 Jan 02 ]

sorry, uploaded several times- got java script error for some reason

Comment by richlv [ 2016 Apr 22 ]

and how many unreachable pollers were there ? maybe this was a duplicate of ZBXNEXT-2359 ?

Comment by sles [ 2016 Apr 25 ]

Hello!

Don't remember how much pollers were there,

Now, on 3.0.2 we have
StartPollersUnreachable=20

Now there are no such events for agent's devices, but I still have this errors for some very old devices using snmp:

19155:20160425:102417.012 SNMP agent item "ifInOctets41" on host "p98a-cc6006-1.p98.belkam.com" failed: first network error, wait for 15 seconds
19180:20160425:102510.648 resuming SNMP agent checks on host "p98a-cc6006-1.p98.belkam.com": connection restored

I guess that they are not fast enough in response.

I'm very sorry for asking this question, but are there any snmp timeout options for zabbix server?

Comment by Aleksandrs Saveljevs [ 2016 Apr 25 ]

Yes, since ZBX-4393 the Timeout configuration option is respected for SNMP checks. Regarding a more fine-grained timeout and retry configuration for SNMP checks, please see ZBXNEXT-1096.

Comment by richlv [ 2016 Apr 26 ]

also haven't seen any graphs on the busy pollers & unreachable pollers. were they all busy, maybe ?
overall this seems to be more of a support case. i'd suggest closing it as a "can't reproduce" case and continuing this on irc and other channels.

Comment by sles [ 2016 Apr 28 ]

Aleksandrs, thank you for pointing to snmp timeout settings, I'll try this.
Anyway, I don't have these zabbix agent related problems at current 3.0.2 zabbix version.
So, looks like, problem is fixed, so, please, close this issue.

richiv, if you don't see any graph- this is because you didn't ask for them

Comment by Aleksandrs Saveljevs [ 2016 Apr 28 ]

Thanks for info! Closing as "Cannot Reproduce" then.

Generated at Sat Apr 20 10:36:02 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.