|
also i trying disable bulk options for all devices. but no luck.
network traffic zabbix-proxy on eth0 - less 1Mbps
tcpdump on eth0 - no icmp forbidden, icmp unreach and no other martians
|
|
This is a bug tracker, so only bugs should be reported here. For community support and troubleshooting, please refer to https://www.zabbix.org/wiki/Getting_help .
In general, since version 2.2 Zabbix only sends a single UDP packet for SNMP. If that gets lost for some reason, you will see what you observe. We have improved it in ZBX-8538, which was just merged. This should help you with the warnings you are getting, because ZBX-8538 makes Zabbix more resilient to network errors.
|
|
Thank you, but reason that i post it here(not community) - errors getting even from device directly connected by 1Gbps link. And i don't see any network issues in my network. I have many tries to locate problem scope, but all trails guide me to this is a bug.
For example.
SNMP agent item "ifAdminStatus[FastEthernet0/18]" on host "lipeck-1-sw1" failed: first network error, wait for 15 seconds
Item wait for UnreachableDelay, and will try again until reach UnreachablePeriod. Right? I try to set UnreachablePeriod large numbers, so tries can tick many times. How it can be that same item can't get data after many many tries, while i success get that data in same time by snmpget?
btw, all other ports than "ifAdminStatus[FastEthernet0/18]" data was recieved successfully, but looks like zabbix stuck on this item(for example) and can't get data for hours?
|
|
compare with zabbix.agent item:
1860:20141028:144116.063 Zabbix agent item "agent.ping" on host "zabbix-proxy" failed: first network error, wait for 15 seconds
2709:20141028:144131.008 Zabbix agent item "agent.ping" on host "zabbix-proxy" failed: another network error, wait for 15 seconds
2679:20141028:144146.017 Zabbix agent item "agent.ping" on host "zabbix-proxy" failed: another network error, wait for 15 seconds
2681:20141028:144201.027 temporarily disabling Zabbix agent checks on host "zabbix-proxy": host unavailable
all looks great. 3*15=45, UnreachablePeriod=45.
but SNMP item problem:
2271:20141028:144120.876 enabling SNMP agent checks on host "suschevka-1-sw2": host became available
2595:20141028:144330.281 SNMP agent item "locIfInOverrun[Vlan20]" on host "suschevka-1-sw2" failed: first network error, wait for 15 seconds
2710:20141028:144345.142 resuming SNMP agent checks on host "suschevka-1-sw2": connection restored
2441:20141028:144430.466 SNMP agent item "locIfCollisions[FastEthernet0/44]" on host "suschevka-1-sw2" failed: first network error, wait for 15 seconds
2655:20141028:144445.234 resuming SNMP agent checks on host "suschevka-1-sw2": connection restored
2431:20141028:144530.544 SNMP agent item "ifInErrors[FastEthernet0/45]" on host "suschevka-1-sw2" failed: first network error, wait for 15 seconds
2696:20141028:144545.332 resuming SNMP agent checks on host "suschevka-1-sw2": connection restored
1973:20141028:144630.446 SNMP agent item "ifInErrors[Vlan1]" on host "suschevka-1-sw2" failed: first network error, wait for 15 seconds
2746:20141028:144645.426 resuming SNMP agent checks on host "suschevka-1-sw2": connection restored
2197:20141028:144731.872 SNMP agent item "locIfInOverrun[GigabitEthernet0/1]" on host "suschevka-1-sw2" failed: first network error, wait for 15 seconds
2765:20141028:144746.581 resuming SNMP agent checks on host "suschevka-1-sw2": connection restored
2565:20141028:144831.886 SNMP agent item "locIfInCRC[FastEthernet0/14]" on host "suschevka-1-sw2" failed: first network error, wait for 15 seconds
2787:20141028:144846.630 resuming SNMP agent checks on host "suschevka-1-sw2": connection restored
2548:20141028:144930.968 SNMP agent item "locIfInGiants[FastEthernet0/27]" on host "suschevka-1-sw2" failed: first network error, wait for 15 seconds
2656:20141028:144945.722 resuming SNMP agent checks on host "suschevka-1-sw2": connection restored
2607:20141028:145031.838 SNMP agent item "ifInErrors[FastEthernet0/34]" on host "suschevka-1-sw2" failed: first network error, wait for 15 seconds
2719:20141028:145046.779 resuming SNMP agent checks on host "suschevka-1-sw2": connection restored
looks like connection restored, but some of data are missing or lost (queue on zabbix server growing up for this host)
|
|
looks like need more investigation. not just close by adding more retries.
|
|
Could you please post tcpdump between Zabbix proxy and "suschevka-1-sw2" host, together with log that would show correspondence between captured traffic and log messages?
|
|
attached pcap + screenshot of queue increase while getting dump
while getting dump zabbix-proxy.log messages:
2329:20141028:152831.657 SNMP agent item "ifInErrors[FastEthernet0/35]" on host "suschevka-1-sw2" failed: first network error, wait for 15 seconds
2752:20141028:152846.809 resuming SNMP agent checks on host "suschevka-1-sw2": connection restored
2139:20141028:152933.840 SNMP agent item "locIfInOverrun[GigabitEthernet0/4]" on host "suschevka-1-sw2" failed: first network error, wait for 15 seconds
2678:20141028:152948.920 resuming SNMP agent checks on host "suschevka-1-sw2": connection restored
2128:20141028:153033.612 SNMP agent item "ifInErrors[FastEthernet0/25]" on host "suschevka-1-sw2" failed: first network error, wait for 15 seconds
2753:20141028:153048.061 resuming SNMP agent checks on host "suschevka-1-sw2": connection restored
2208:20141028:153130.951 SNMP agent item "locIfInOverrun[FastEthernet0/22]" on host "suschevka-1-sw2" failed: first network error, wait for 15 seconds
2684:20141028:153145.043 resuming SNMP agent checks on host "suschevka-1-sw2": connection restored
1868:20141028:153230.872 SNMP agent item "locIfInRunts[FastEthernet0/28]" on host "suschevka-1-sw2" failed: first network error, wait for 15 seconds
2715:20141028:153245.139 resuming SNMP agent checks on host "suschevka-1-sw2": connection restored
2450:20141028:153331.158 SNMP agent item "ifOutErrors[FastEthernet0/10]" on host "suschevka-1-sw2" failed: first network error, wait for 15 seconds
2677:20141028:153346.253 resuming SNMP agent checks on host "suschevka-1-sw2": connection restored
2284:20141028:153430.059 SNMP agent item "locIfInIgnored[FastEthernet0/31]" on host "suschevka-1-sw2" failed: first network error, wait for 15 seconds
2729:20141028:153445.443 resuming SNMP agent checks on host "suschevka-1-sw2": connection restored
2285:20141028:153530.185 SNMP agent item "ifOutErrors[FastEthernet0/39]" on host "suschevka-1-sw2" failed: first network error, wait for 15 seconds
2657:20141028:153545.504 resuming SNMP agent checks on host "suschevka-1-sw2": connection restored
|
|
#31162
Looks like only one try for item.
|
- while [ 1=0 ]; do snmpget -r0 -t1 -v2c 10.0.2.163 -On -c hidden 1.3.6.1.2.1.2.2.1.5.10026 ; done;
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
.1.3.6.1.2.1.2.2.1.5.10026 = Gauge32: 100000000
no errors....
|
|
btw,
look pic #31162
why no retries within unreachperiod ? just only one shot.
|
|
upgraded to 2.4.2rc1.
same errors keep going.
32188:20141028:175922.361 resuming SNMP agent checks on host "voronezh-1-sw1": connection restored
32172:20141028:175924.368 resuming SNMP agent checks on host "rostovdon-1-sw1": connection restored
32191:20141028:175926.384 resuming SNMP agent checks on host "ekaterinburg-1-sw1": connection restored
31947:20141028:175929.840 SNMP agent item "locIfInFrame[GigabitEthernet1/0/4]" on host "n.novgorod-cc-c3750" failed: first network error, wait for 15 seconds
31498:20141028:175929.979 SNMP agent item "locIfInIgnored[FastEthernet0/22]" on host "saratov-1-sw1" failed: first network error, wait for 15 seconds
32180:20141028:175944.362 resuming SNMP agent checks on host "n.novgorod-cc-c3750": connection restored
32069:20141028:175944.390 resuming SNMP agent checks on host "saratov-1-sw1": connection restored
31564:20141028:175948.983 SNMP agent item "IfHCOutOctets[GigabitEthernet0/2]" on host "novosibirsk-1-sw1" failed: first network error, wait for 15 seconds
31514:20141028:175957.865 SNMP agent item "locIfCollisions[FastEthernet0/23]" on host "mck1-bnd-s2960" failed: first network error, wait for 15 seconds
32153:20141028:180003.436 resuming SNMP agent checks on host "novosibirsk-1-sw1": connection restored
31457:20141028:180011.908 SNMP agent item "ifHCInOctets[GigabitEthernet2/0/10]" on host "ostrovnaya-1-c3750" failed: first network error, wait for 15 seconds
31334:20141028:180012.023 SNMP agent item "locIfInFrame[FastEthernet0/2]" on host "ekaterinburg-1-sw1" failed: first network error, wait for 15 seconds
32054:20141028:180012.366 resuming SNMP agent checks on host "mck1-bnd-s2960": connection restored
32147:20141028:180026.374 resuming SNMP agent checks on host "ostrovnaya-1-c3750": connection restored
32111:20141028:180027.399 resuming SNMP agent checks on host "ekaterinburg-1-sw1": connection restored
31741:20141028:180029.874 SNMP agent item "ifSpeed[FastEthernet0/18]" on host "saratov-1-sw1" failed: first network error, wait for 15 seconds
32120:20141028:180044.409 resuming SNMP agent checks on host "saratov-1-sw1": connection restored
|
|
why snmp check does not trying to retry getting data ?
"failed: first network error, wait for 15 seconds"
confused me, because in real no other checks or retries for snmp. no unreachable poller etc.
|
|
Based on "suschevka-1-sw2.pcap" and the log you have provided, most of your items are queried once per 60 seconds. Since Zabbix 2.2.3 (ZBXNEXT-98), all items on the same interface are queried at the same time, provided they have the same connection parameters. In your case, this is around 750 individual SNMP requests every minute (probably bulk has degraded to 1 item per request due to ZBX-8528). It is no surprise that some of these 750 requests got lost. Since Zabbix 2.2 (ZBX-4393) proxy does not perform any retries, that is why you see network errors in Zabbix log.
After these network errors, it can be seen that the host is recovered after 15 seconds, as promised:
Issue ZBXNEXT-2200 addresses the spike-like load on large SNMP devices and should help with these network errors. Issue ZBX-8538 adds retry of 1, which should help you, too.
Please let us know whether there is still anything that is unclear.
|
|
thank you, now everthing is clear.
ok, if some of items quered once per hour and they was lost i will always have triggered event about "queue over 100 in 10m".
btw. i updated to 2.4.2rc1.. but bulk still degrated to 1 item.
|
|
Version 2.4.2rc1 was not released yet. Did you take the code from svn? If so, which revision?
|
|
Zabbix 2.4.2rc1 (revision 50194)
was downloaded from http://www.zabbix.com/developers.php
|
|
ZBX-8538 was merged in r50194, so your copy should have that fix included. Do you have bulk requests enabled for device (the checkbox in SNMP interface configuration)? Could you please post tcpdump for this device (beginning right after proxy restart), so that we can see the request pattern that the proxy sends?
|
|
there is dump.
after full day of running Zabbix 2.4.2rc1 i agree that miss queue has gone.
|
|
Attached "suschevka-1-sw2-after-restart-proxy.pcap" shows no sign of SNMP bulk degrading to 1 variable per request. Rather, it has stabilized on 62 variables. Sometimes, those requests get lost and proxy retries with 31 variables, but the general kind of requests is with 62 variable bindings.
If you have any other issues to troubleshoot, using https://www.zabbix.org/wiki/Getting_help is appreciated.
|
Generated at Wed Apr 01 10:38:53 EEST 2026 using Jira 10.3.13#10030013-sha1:56dd970ae30ebfeda3a697d25be1f6388b68a422.