[ZBX-8610] Zabbix poller bulk SNMP error Created: 2014 Aug 12 Updated: 2017 May 30 Resolved: 2014 Aug 19 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Proxy (P), Server (S) |
Affects Version/s: | 2.2.5 |
Fix Version/s: | None |
Type: | Incident report | Priority: | Critical |
Reporter: | Raimonds Treimanis | Assignee: | Unassigned |
Resolution: | Duplicate | Votes: | 0 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() |
||||||||
Issue Links: |
|
Description |
Probably its somehow connected with |
Comments |
Comment by Aleksandrs Saveljevs [ 2014 Aug 12 ] |
It might be that there is a bug, but there was certainly no conscious decision to have ICMP checks affect SNMP and vice versa - the codes for both check types are completely independent. How many SNMP items a poller takes for processing is dependent on many factors. In order for a poller to take a bunch of SNMP items to be queried simultaneously, they should all have the same type, the same connection parameters, be ordinary or use dynamic indices, not be a discovery rule, not use macros, etc. Could you please describe whether all your SNMP items are regular ones and of the same type? Are you using SNMP items with dynamic indices? Low-level discovery? Macros in OIDs? If you could obfuscate your *.pcap file and attach here, or send it to me without obfuscation, I would be happy to take a look at it. |
Comment by Raimonds Treimanis [ 2014 Aug 13 ] |
Most of SNMP items are LLD and use dynamic indexes. |
Comment by Aleksandrs Saveljevs [ 2014 Aug 13 ] |
While you are working on the *.pcap files, could you please attach the template you are using for this host (or host configuration itself, if there are host-specific items)? |
Comment by Raimonds Treimanis [ 2014 Aug 13 ] |
Attached. |
Comment by Aleksandrs Saveljevs [ 2014 Aug 13 ] |
A quick look at the attached template showed that SNMP items are all SNMPv2, seem to have the same connection parameters, no user macros are used in OIDs, and there are no items with dynamic indices (i.e., those of the form IF-MIB::ifPhysAddress["index","IF-MIB::ifDescr","Ethernet"]). |
Comment by Raimonds Treimanis [ 2014 Aug 14 ] |
Cant find any tool capable to remove SNMP community from pcap. Any suggestions? |
Comment by Aleksandrs Saveljevs [ 2014 Aug 14 ] |
Community strings seem to be stored in plain text in *.pcap files, so something like the following might work: $ sed -i 's/public/hidden/g' capture.pcap |
Comment by Raimonds Treimanis [ 2014 Aug 15 ] |
Ok, attached dumps for both screenshots |
Comment by Aleksandrs Saveljevs [ 2014 Aug 15 ] |
I took a look at the first dump file, "dumpx", which corresponds to "icmp1.png", and here are the results of my investigation. First of all, let's note that there are a lot of items that are checked at 14:54:37. There are no SNMP packets half a minute prior to that and no SNMP packets half a minute after that. These items were all scheduled to be checked at the same time. However, since there were many of them, they were processed by multiple pollers. If we observe traffic at 14:54:37, we will see that packets start at 1 OID per request and successively grow up to 63 items per request. This is all good and expected. Now, let's note that some of the packets at 14:54:37 do not have a response packet. In order to see these packets in Wireshark, select a packet, find "request-id" field in it, right click and select "Apply as Column". Then, add filter "frame.number >= 2904 && frame.number <= 3222" - this will filter packets at 14:54:37. Finally, sort packets by "request-id" column and see which request packets do not have a corresponding response packet. This result is partially shown on "icmp1-no-response.png" - I have marked packets which do not have a response with black. They all have either 3 or 4 OIDs in the request. My guess is that the network or the device could not cope with that many packets and dropped some of them. Based on the investigation, my guess is that "Timeout" configuration parameter is 30 seconds. Consequently, after going up to 63 items per request at 14:54:37 and getting a timeout on these other marked packets 30 seconds later, Zabbix set 3 as the minimum size of the failed request. So, it halved the number of OIDs in the request and at 14:55:07 asked for "1.3.6.1.4.1.9.9.166.1.15.1.1.17.1029739156.65536" (the first OID in packet #2951) in packet #3231 (this is seen in "icmp1.png"). It got the response in packet #3233, and asked for "1.3.6.1.4.1.9.9.166.1.15.1.1.17.1029739172.65536" and "1.3.6.1.4.1.9.9.166.1.15.1.1.17.1029739220.196608" (the other two OIDs in packet #2951) in packet #3235. This is the correct retry mechanism described in https://www.zabbix.com/documentation/2.2/manual/config/items/itemtypes/snmp#internal_workings_of_bulk_processing . The above scenario is the reason Zabbix dropped the number of OIDs in the request to 2. I have not investigated further, but there is probably another scenario a bit later that caused Zabbix to drop the number of OIDs in the request to 1. Thus, it works as designed and I propose to consider this issue as a duplicate of |