[ZBX-7426] snmp checks fail with failed: first network error, wait for 15 seconds Created: 2013 Nov 22  Updated: 2022 Oct 08  Resolved: 2014 Aug 02

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Incident report Priority: Trivial
Reporter: sles Assignee: Unassigned
Resolution: Duplicate Votes: 6
Labels: retry, snmp
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

centos 6.4


Attachments: JPEG File screenshot-1.jpg    
Issue Links:
Duplicate
duplicates ZBXNEXT-1096 Configurable Timeout per item (host i... Closed

 Description   

Hello!

There are messages in server log:

[root@zabbix zabbix]# grep "Khokhryaki ABK room 130 UPS 2200" zabbix_server.log 
  6840:20131122:054359.552 SNMP agent item [upsAdvBatteryCapacity] on host [Khokhryaki ABK room 130 UPS 2200] failed: first network error, wait for 15 seconds
  6937:20131122:054400.590 SNMP agent item [upsAdvInputLineFailCause] on host [Khokhryaki ABK room 130 UPS 2200] failed: first network error, wait for 15 seconds
  7060:20131122:054415.207 resuming SNMP agent checks on host [Khokhryaki ABK room 130 UPS 2200]: connection restored
  6827:20131122:063504.611 SNMP agent item [upsAdvOutputCurrent] on host [Khokhryaki ABK room 130 UPS 2200] failed: first network error, wait for 15 seconds
  6978:20131122:063505.612 SNMP agent item [upsAdvOutputLoad] on host [Khokhryaki ABK room 130 UPS 2200] failed: first network error, wait for 15 seconds
  7061:20131122:063520.581 resuming SNMP agent checks on host [Khokhryaki ABK room 130 UPS 2200]: connection restored
  6837:20131122:064537.468 SNMP agent item [upsBasicBatteryStatus] on host [Khokhryaki ABK room 130 UPS 2200] failed: first network error, wait for 15 seconds
  7055:20131122:064552.802 resuming SNMP agent checks on host [Khokhryaki ABK room 130 UPS 2200]: connection restored
  6844:20131122:070129.039 SNMP agent item [upsAdvBatteryCapacity] on host [Khokhryaki ABK room 130 UPS 2200] failed: first network error, wait for 15 seconds
  6796:20131122:070130.106 SNMP agent item [upsAdvInputLineFailCause] on host [Khokhryaki ABK room 130 UPS 2200] failed: first network error, wait for 15 seconds
  7053:20131122:070145.585 resuming SNMP agent checks on host [Khokhryaki ABK room 130 UPS 2200]: connection restored
  6837:20131122:072029.159 SNMP agent item [upsAdvBatteryCapacity] on host [Khokhryaki ABK room 130 UPS 2200] failed: first network error, wait for 15 seconds
  7055:20131122:072044.904 resuming SNMP agent checks on host [Khokhryaki ABK room 130 UPS 2200]: connection restored
  6907:20131122:072529.707 SNMP agent item [upsAdvBatteryCapacity] on host [Khokhryaki ABK room 130 UPS 2200] failed: first network error, wait for 15 seconds
  6923:20131122:072530.044 SNMP agent item [upsAdvInputLineFailCause] on host [Khokhryaki ABK room 130 UPS 2200] failed: first network error, wait for 15 seconds
  7061:20131122:072545.188 resuming SNMP agent checks on host [Khokhryaki ABK room 130 UPS 2200]: connection restored
  6894:20131122:075559.451 SNMP agent item [upsAdvBatteryCapacity] on host [Khokhryaki ABK room 130 UPS 2200] failed: first network error, wait for 15 seconds
  6869:20131122:075600.478 SNMP agent item [upsAdvInputLineFailCause] on host [Khokhryaki ABK room 130 UPS 2200] failed: first network error, wait for 15 seconds
  7061:20131122:075615.297 resuming SNMP agent checks on host [Khokhryaki ABK room 130 UPS 2200]: connection restored

and no data retrieved for several hosts.

For particular one I created simple script on the same host which checks the same value every minute:

#!/bin/sh
date  >>/var/log/snmpgettest
snmpget -v1 -c public 192.168.46.42 1.3.6.1.4.1.318.1.1.1.3.2.5.0 >>/var/log/snmpgettest

And it never failed for last 24 hours, so looks like this is zabbix bug...

Thank you!



 Comments   
Comment by Aleksandrs Saveljevs [ 2013 Nov 25 ]

I used to be getting the same kind of behavior when I was running a UDP-intensive network application alongside Zabbix server. Once I stopped that application, the error no longer appears.

So Zabbix probably fails because the UDP request it sends is dropped along the path and, since ZBX-4393, it does not retry. This might be fixed later in ZBXNEXT-1096.

The reason snmpget does not fail is because it retries 5 times by default if getting value fails. Try repeating the same test with "-r 0" option added to snmpget invocation.

Comment by sles [ 2013 Nov 29 ]

Hello!

I'd like to add retries to zabbix
But it is absolutely unclear for me how can I do this, even after looking to ZBXNEXT-1096.
Could you tell me how?
Thank you!

Comment by Aleksandrs Saveljevs [ 2013 Nov 29 ]

Adding retries to Zabbix is trivial: you should wait until ZBXNEXT-1096 is implemented.

If you wish to patch Zabbix server to work around in the meanwhile, you can change "session.retries = 0" and "session.timeout = ..." in src/zabbix_server/poller/checks_snmp.c. Although I have tried that and it did not help in my scenario.

Have you performed the test again with snmpget by adding "-r 0" to the command line?

Comment by sles [ 2013 Nov 29 ]

Thank you, I'll try to patch.

Just added -r0, will inform about results after 1-2 days.
Anyway, I also ping these hosts and there are no icmp loss there, hope the same for udp

Comment by sles [ 2013 Nov 30 ]

Well, changing session.retries = 5 doesn't help.
But snmpget -r0 always get data.
This is strange, but it means problem is not just in snmp...

Comment by Aleksandrs Saveljevs [ 2013 Dec 02 ]

In my case, changing "session.retries = 3" did not help either. When I investigated the problem a bit, tcpdump showed that Zabbix sends request packets, but there is no response. Log on SNMP device shows that in those cases it sometimes drops outgoing UDP packets (i.e., IF-MIB::ifOutDiscards.1 increases), but sometimes it does not. So the problem might be on the device side, where it limits request or response rate, but I have not found such a setting yet.

Comment by sles [ 2013 Dec 03 ]

anyway, I get far less such errors in log after increasing retries...

Comment by Cristian Vasquez Lucic [ 2013 Dec 03 ]

Hi guys, im getting the same issue with all my hosts, i have cacti and Zabbix in the same Machine, can you point me where i can find the "src/zabbix_server/poller/checks_snmp.c" file?, im running Zabbix 2.2 in a Centos 6.4 64-bit machine with mysql and i can find this file or the "session.retries = 0" anywhere.

The real issue is that all my Zabbix graphs are incomplete, but all my cacti graphs look fine.

Comment by Aleksandrs Saveljevs [ 2013 Dec 04 ]

You can download Zabbix sources from http://www.zabbix.com/download.php or check them out from Subversion repository at svn://svn.zabbix.com.

Comment by Przemek [ 2013 Dec 18 ]

Hi guys,
I have the same problem as you. There is one more thing that I would like to add. When I disable monitoring on devices on which I get an error, the same problem starts occurring on the others. Snmpget always works.
Thanks,

Comment by Vlad Ciobancai [ 2013 Dec 31 ]

I have the same problem on Zabbix 2.2.1, I would like to know if there will be some fix for this problem because is very annoying

Comment by Corey Shaw [ 2014 Jan 16 ]

Just a thought, but there may be a few of you that are seeing this error because your Zabbix poller processes are just busy and you simply need more of them. I'd suggest reading and implementing stuff here => http://blog.zabbix.com/monitoring-how-busy-zabbix-processes-are/457/ before blaming this on a bug (which it might legitamitely be, but pollers should be checked first).

Comment by sles [ 2014 Jan 17 ]

they are checked. not all pollers are busy
thank you!

Comment by Ali HBB [ 2014 Mar 01 ]

Same Problem Here
No BUSY pooler and cacti works well
but zabbix have gap in graphs and there are error on zabbix server log says that
first network error, wait ......

Comment by Vlad Ciobancai [ 2014 Mar 01 ]

Hey, for me the problem disappears after the snmpd application on application nodes (we use 6 application servers with RHEL 5.10) has been restarted.

Comment by diego serrano [ 2014 May 26 ]

Hello!
I have the same problem.
I'm running Zabbix 2.2.2 on RHEL 6.4 and when i'm monitoring Windows Servers through SNMP, my server writes randomly in logs:

5989:20140526:130943.953 SNMP agent item "hrStorageUsed[C:\ Label: Serial Number 60b133af]" on host "xxx" failed: first network error, wait for 15 seconds
5994:20140526:130958.335 resuming SNMP agent checks on host "xxx": connection restored

There is not a network problem.
Thanks for help.

Regards

Comment by Vlad Ciobancai [ 2014 May 26 ]

Please update the Zabbix agents and Zabbix Server to the latest version 2.2.3. They fixed this bug: https://support.zabbix.com/browse/ZBXNEXT-98 and for me is working without any problems
I updated the zabbix agents config file for Time Out value to 30.

Comment by jean-marc CHORIER [ 2014 Jul 04 ]

Hi, FYI
I am on 2.2.3 version and I work only with snmpV3 ans network equipments
I have a lot of error msg in my log
the timeout is set to 30 on the server config file
regards

Comment by Vlad Ciobancai [ 2014 Jul 04 ]

Hi jean,

Can you paste the errors that you received in zabbix_server log ?

Comment by Raimonds Treimanis [ 2014 Jul 22 ]

Im getting same error on regular basis. Most of my items are SNMPv2 (monitoring Cisco routers)
49214:20140722:061438.566 SNMP agent item "cbQosCMDropPkt64[12370,65536]" on host "xxx" failed: first network error, wait for 15 seconds
49326:20140722:061453.126 resuming SNMP agent checks on host "xxx": connection restored
49185:20140722:061453.857 received configuration data from server, datalen 3124127
49199:20140722:061538.747 SNMP agent item "ifInErrorsVlan[3369]" on host "xxx" failed: first network error, wait for 15 seconds
49332:20140722:061553.094 resuming SNMP agent checks on host "xxx": connection restored
Monitoring is done through proxies. Zabbix 2.2.4
If i move any host from one proxy to other (both are identical VMs and have identical configs) for some time (really random, couldnt find any dependencies) everything works fine and then same errors start aagain.
It doesnt depend on load, tried several things:
1. Increased number of pollers to 100, 200 even 1000
2. Moved all hosts to proxy 1 and left only one host on proxy 2
Nothing changed. Proxies have plenty of CPU and RAM available.

Also i noticed that number of busy pollers is much higher than it should be during those error periods, although far from 100% In attached graph you can see it. Notice that nothing in config was changed. At 22:20 errors suddenly started and were appearing until 10:45 when i restartdt zabbix-proxy. After restart they just disappear, to eventually return after some random period of time.

Comment by Oleksii Zagorskyi [ 2014 Aug 02 ]

After recently a lot investigating SNMP and rereading now this thread I can say that it's all about network errors and lost UPD packets.
As Aleksandrs stated in 1st comment - starting from 2.2 zabbix does not do 5 retries (at library level) for every snmp get.

Note that starting from 2.2.3 it can look a bit differently, and last Raimonds' comment absolutely confirm my issue report ZBX-8528.

I don't think we need to continue discussion here, problem is clear.
It better to vote for ZBXNEXT-1096

Comment by Oleksii Zagorskyi [ 2014 Aug 02 ]

Well, I'm closing this issue as duplicate.
Feel free to reopen if you are sure I did wrong thing.

Generated at Fri Apr 19 17:34:20 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.