[ZBX-8610] Zabbix poller bulk SNMP error Created: 2014 Aug 12  Updated: 2017 May 30  Resolved: 2014 Aug 19

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Proxy (P), Server (S)
Affects Version/s: 2.2.5
Fix Version/s: None

Type: Incident report Priority: Critical
Reporter: Raimonds Treimanis Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File dumpx     HTML File dumpx2     PNG File icmp1-no-response.png     PNG File icmp1.png     PNG File icmp2.png     XML File zbx_export_templates (1).xml    
Issue Links:
Duplicate
duplicates ZBX-8528 random lost UDP packets lead to not b... Closed

 Description   

Probably its somehow connected with ZBX-8528. Algorithm how SNMP poller decides how many OIDs to get per request seems to be buggy.
I have noticed that on some hosts i start to get "First network error" in log right after proxy restart, so i decided to sniff some trafiic.
From dumps i see that right after restart poller starts to increase number of requests per packet as intended. And then suddenly ICMP ping check to same host with 5 packets kicks in. And right after that SNMP poller drops OIDs per packet to 1-3.
Probably it takes those ICMP checks in account and decides that only 5 OIDs were successful?



 Comments   
Comment by Aleksandrs Saveljevs [ 2014 Aug 12 ]

It might be that there is a bug, but there was certainly no conscious decision to have ICMP checks affect SNMP and vice versa - the codes for both check types are completely independent.

How many SNMP items a poller takes for processing is dependent on many factors. In order for a poller to take a bunch of SNMP items to be queried simultaneously, they should all have the same type, the same connection parameters, be ordinary or use dynamic indices, not be a discovery rule, not use macros, etc. Could you please describe whether all your SNMP items are regular ones and of the same type? Are you using SNMP items with dynamic indices? Low-level discovery? Macros in OIDs?

If you could obfuscate your *.pcap file and attach here, or send it to me without obfuscation, I would be happy to take a look at it.

Comment by Raimonds Treimanis [ 2014 Aug 13 ]

Most of SNMP items are LLD and use dynamic indexes.
And zabbix indeed can bunch them up, cause i see how it does it from my dumps, up to 50-70 OIDs per request.
Will look at pcaps later and try to remove sensitive info.

Comment by Aleksandrs Saveljevs [ 2014 Aug 13 ]

While you are working on the *.pcap files, could you please attach the template you are using for this host (or host configuration itself, if there are host-specific items)?

Comment by Raimonds Treimanis [ 2014 Aug 13 ]

Attached.

Comment by Aleksandrs Saveljevs [ 2014 Aug 13 ]

A quick look at the attached template showed that SNMP items are all SNMPv2, seem to have the same connection parameters, no user macros are used in OIDs, and there are no items with dynamic indices (i.e., those of the form IF-MIB::ifPhysAddress["index","IF-MIB::ifDescr","Ethernet"]).

Comment by Raimonds Treimanis [ 2014 Aug 14 ]

Cant find any tool capable to remove SNMP community from pcap. Any suggestions?

Comment by Aleksandrs Saveljevs [ 2014 Aug 14 ]

Community strings seem to be stored in plain text in *.pcap files, so something like the following might work:

$ sed -i 's/public/hidden/g' capture.pcap
Comment by Raimonds Treimanis [ 2014 Aug 15 ]

Ok, attached dumps for both screenshots

Comment by Aleksandrs Saveljevs [ 2014 Aug 15 ]

I took a look at the first dump file, "dumpx", which corresponds to "icmp1.png", and here are the results of my investigation.

First of all, let's note that there are a lot of items that are checked at 14:54:37. There are no SNMP packets half a minute prior to that and no SNMP packets half a minute after that. These items were all scheduled to be checked at the same time. However, since there were many of them, they were processed by multiple pollers. If we observe traffic at 14:54:37, we will see that packets start at 1 OID per request and successively grow up to 63 items per request. This is all good and expected.

Now, let's note that some of the packets at 14:54:37 do not have a response packet. In order to see these packets in Wireshark, select a packet, find "request-id" field in it, right click and select "Apply as Column". Then, add filter "frame.number >= 2904 && frame.number <= 3222" - this will filter packets at 14:54:37. Finally, sort packets by "request-id" column and see which request packets do not have a corresponding response packet. This result is partially shown on "icmp1-no-response.png" - I have marked packets which do not have a response with black. They all have either 3 or 4 OIDs in the request. My guess is that the network or the device could not cope with that many packets and dropped some of them.

Based on the investigation, my guess is that "Timeout" configuration parameter is 30 seconds. Consequently, after going up to 63 items per request at 14:54:37 and getting a timeout on these other marked packets 30 seconds later, Zabbix set 3 as the minimum size of the failed request. So, it halved the number of OIDs in the request and at 14:55:07 asked for "1.3.6.1.4.1.9.9.166.1.15.1.1.17.1029739156.65536" (the first OID in packet #2951) in packet #3231 (this is seen in "icmp1.png"). It got the response in packet #3233, and asked for "1.3.6.1.4.1.9.9.166.1.15.1.1.17.1029739172.65536" and "1.3.6.1.4.1.9.9.166.1.15.1.1.17.1029739220.196608" (the other two OIDs in packet #2951) in packet #3235. This is the correct retry mechanism described in https://www.zabbix.com/documentation/2.2/manual/config/items/itemtypes/snmp#internal_workings_of_bulk_processing .

The above scenario is the reason Zabbix dropped the number of OIDs in the request to 2. I have not investigated further, but there is probably another scenario a bit later that caused Zabbix to drop the number of OIDs in the request to 1. Thus, it works as designed and I propose to consider this issue as a duplicate of ZBX-8528, with a potential temporary solution in the form of ZBX-8538.

Generated at Fri Apr 04 14:37:28 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.