[ZBXNEXT-98] Use SNMP getbulk for OID retrieval Created: 2009 Oct 06  Updated: 2014 Apr 30  Resolved: 2014 Mar 13

Status: Closed
Project: ZABBIX FEATURE REQUESTS
Component/s: Proxy (P), Server (S)
Affects Version/s: None
Fix Version/s: 2.2.3, 2.3.0

Type: Change Request Priority: Major
Reporter: Gergely Czuczy Assignee: Unassigned
Resolution: Fixed Votes: 59
Labels: performance, snmp
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

SNMP


Issue Links:
Duplicate
is duplicated by ZBXNEXT-1210 Combining SNMP gets Closed
is duplicated by ZBXNEXT-456 Support for SNMP ifTable Closed
is duplicated by ZBX-6842 SNMP host not monitored Closed

 Description   

The basic idea is, the poller process could use SNMP getbulks for OID retrievals. Polling items individually can cost a lot more resources than doing them in batches, especially using over SNMP. Instead of being every OID a separate request, getbulks could be used.

The way I could image this, is teaching the polling scheduler to take look at the queue for the given host, and if more SNMP items are awaiting, they could be fetched using one or more getbulks, instead of individually. This would make polling somewhat easier and faster.



 Comments   
Comment by Christophe Prevotaux [ 2009 Oct 06 ]

Another reason why this has to be implemented is that 2 values that are closely related will not get data at the time in the current scheme of things
with snmpbulkget related values polling will be an atomic operation.

exple SNMP value for IfOutOctet and IfInOctet for a particular interface are related and so are any value related that Interface INDEX OID wise.
All these value should be polled at the same time. If not then all the polled data are meaningless for time critical event comparaison (think of it as a data batch , the data belonging to one batch should be polled together).

Comment by richlv [ 2009 Oct 06 ]

i have a suspicion coupling requests for a particular interface might be a slightly different issue.
first suggestion only deals with checks that are closely scheduled, second actually implies scheduling changes - i'd suggest filing that as a separate issue

Comment by Aleksandrs Saveljevs [ 2010 Jun 10 ]

Related issue: ZBXNEXT-391.

Comment by fmrapid [ 2010 Nov 30 ]

All SNMP queries to the same host with the same polling period should be executed together. This covers Cristophe Prevotaux's comment as well as this ZBXNEXT-98. SNMP polling efficiency is a very big deal in terms of data accuracy, data retrieval delay and server processing. This will also improve dealing with timeouts.

Comment by Raymond Kuiper [ 2011 Apr 13 ]

I've been longing for SNMP bulk request for a long time, it will relieve the firewall state stables a lot when this would be implemented.
I suggest grouping items with the same polling interval for a certain host together to achieve this.

Comment by Attilla de Groot [ 2012 Apr 28 ]

Any updates on this one?

Comment by Raymond Kuiper [ 2012 Aug 30 ]

I'm also still waiting for this...it would increase Zabbix SNMP performance a lot and decrease load on the monitored devices at the same time.
With LLD production ready, Zabbix is steadily becoming a better SNMP monitoring solution and this feature will help Zabbix mature further into this field.

Comment by Andre Sachs [ 2012 Aug 30 ]

This would be a big win for high latency environments, specifically satellite based networks.

Comment by Florian Koch [ 2012 Oct 23 ]

this would be a huge improvment over the current situation.

Comment by Oleksii Zagorskyi [ 2013 Jan 09 ]

I hope it could help to solve ZBX-5028

Comment by Raymond Kuiper [ 2013 Jan 10 ]

It probably would.

Comment by Aleksandrs Saveljevs [ 2013 Nov 15 ]

Link to a discussion on net-snmp-users mailing list about querying multiple OIDs in a single GET request:

http://sourceforge.net/mailarchive/forum.php?thread_name=5285E4F3.4000400%40zabbix.com&forum_name=net-snmp-users

Comment by Dimitri Bellini [ 2013 Nov 15 ]

Hi have test on some Brocade Switch the problem posted on the Net-SNMP Forum, and have this report:


> head -90 walk.txt | cut -d' ' -f1 | xargs snmpget -d -r 0 -v3 -u admin -n VF:3 myswitch
> Sending 1909 bytes to UDP


And everything is working without problem...

If i will try with 100 value the switch response is:


> head -100 walk.txt | cut -d' ' -f1 | xargs snmpget -d -r 0 -v3 -u admin -n VF:3 myswitch
> Received 111 bytes from UDP: [myipaddress]:161
0000: 30 6D 02 01 03 30 10 02 04 11 E2 CD 5D 02 02 08 0m...0......]...
0016: 00 04 01 00 02 01 03 04 1F 30 1D 04 0D 80 00 06 .........0......
0032: 34 B2 10 00 00 05 33 98 A2 00 02 01 17 02 03 28 4.....3........(
0048: 7B D3 04 00 04 00 04 00 30 35 04 0D 80 00 06 34 {.......05.....4
0064: B2 10 00 00 05 33 98 A2 00 04 00 A8 22 02 04 6F .....3......"..o
0080: BB 09 2B 02 01 00 02 01 00 30 14 30 12 06 0A 2B ........0.0...
0096: 06 01 06 03 0F 01 01 04 00 41 04 02 40 D1 67 [email protected]

snmpget: Too long


If someone need other test please ask me.
Thanks

Comment by Aleksandrs Saveljevs [ 2013 Nov 20 ]

While I was thinking about the implementation for getting multiple SNMP values at once, I did some cleaning up and refactoring of our current SNMP code. This does not change the functionality in any way (except possibly fixing some bugs no one ever encountered), but it seems to me the code is more pleasant now to work with.

Improvements are available in development branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-98 and it would be nice to merge them into 2.2 branch before continuing with further development. sasha, please take a look at them.

sasha Successfully REVIEWED and TESTED.

Please review my changes in r40394.

asaveljevs Wonderful. Thank you! CLOSED.

asaveljevs Code improvements merged into pre-2.2.1 r40401 and pre-2.3.0 (trunk) r40402.

Comment by Raymond Kuiper [ 2013 Dec 06 ]

I'm very happy with seeing progress on this issue, thank you!
I hope it will make the 2.2.2 release

Comment by Dimitri Bellini [ 2014 Jan 31 ]

Please add this feature as soon as possible, my customer Switchs is flooded from Zabbix snmp request and sometimes the network stack don't respond anymore to the network connection.
Thanks

Comment by Aleksandrs Saveljevs [ 2014 Feb 12 ]

The feature is available in development branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-98 .

I shall describe the general approach in the coming comments. Any documentation updates will be made after the testing is complete.

There are still some open questions I would like to discuss with the reviewer like (1) below, so the implementation still requires a bit of work. Other than that, the branch is ready for testing.

Comment by Aleksandrs Saveljevs [ 2014 Feb 12 ]

(1) There are two places where getting multiple values in a single request was implemented.

The first one is getting regular items or verifying indices for dynamic SNMP item cache. This is done in function zbx_snmp_get_values() in src/zabbix_server/poller/checks_snmp.c. There, a GetRequest-PDU is used with multiple variable bindings.

The second one is walking the OID trees for dynamic SNMP item caching purposes and for SNMP discovery. This is done in function zbx_snmp_walk() in the same file. There, a GetNextRequest-PDU is used for SNMPv1 (same as before) and GetBulkRequest-PDU is used for SNMPv2 and SNMPv3.

Let's start with the first case. As mentioned on the Net-SNMP mailing list above, if we query a device for multiple values, it can either (a) return a proper response, (b) return a "tooBig(1)" error or (c) do not respond at all, yielding a timeout.

So we have to find a way to find the optimal number of values to query for each device. The way it is currently implemented in the development branch is we start cautiously with 1 value per query. If that succeeds, the next time we query 2 values at once. If that succeeds, we query 3 and so on, up to 128 values in a single request.

If we query N values and the request fails (we either get a "tooBig(1)" or timeout), the general approach is to retry with N/2 values and remember that we should not try requesting N values again, but let N-1 be our maximum.

The approach is summarized more precisely in the following comment in the source code:

/* Since we are trying to obtain multiple values from the SNMP agent, the response that it has to  */
/* generate might be too big. It seems to be required by the SNMP standard that in such cases the  */
/* error status should be set to "tooBig(1)". However, some devices simply do not respond to such  */
/* queries and we get a timeout. Moreover, some devices exhibit both behaviors - they either send  */
/* "tooBig(1)" or do not respond at all. So what we do is halve the number of variables to query - */
/* it should work in the vast majority of cases, because, since we are now querying "num" values,  */
/* we know that querying "num/2" values succeeded previously. The case where it can still fail due */
/* to exceeded maximum response size is if we are now querying values that are unusually large. So */
/* if querying with half the number of the last values does not work either, we resort to querying */
/* values one by one, and the next time configuration cache gives us items to query, it will give  */
/* us less. */

So the expected server behavior is that it starts cautiously with a higher load, then gradually (but rather quickly) the load is reduced due to the fact that more and more values are queried in a single request, and finally stabilizes once the limit for each device is determined.

Now, this is good, but would it be possible to come up with a better approach? It is also relevant for GetBulkRequest-PDU, because some devices give a timeout, if max-repetitions field is too high. If we implement the same "start with 1 and increase" strategy for walking, the problem is that the bulk effect will be observed not as soon as with regular polling, because items like discovery rules are meant to run less frequently. So, if we do, the limit counter should probably be the same as for GetRequest-PDU.

Currently, for GetBulkRequest-PDU, the max-repetitions field is hardcoded to 10 and this is something I would like to discuss with the reviewer and improve.

wiper I think we can safely use the same logic for max-repetitions increase. While the bulk effect might build up slower than with regular polling, it also means there are less requests - so less load on the target system.

And as discussed instead of increasing the number of values per request by 1 we might increase it 1.5 times to hit the limit faster.

asaveljevs I have implemented the strategy of multiplying by 1.5 and applied that to "max-repetitions", too, in r43428, r43429 and r43432. RESOLVED.

wiper A nice optimization would be to multiply the number by 1.5 until the first fail, then drop back to last succeeded number and continue to increment it by 1 until it fails.
REOPENED

asaveljevs RESOLVED in r43461.

wiper CLOSED

Comment by Aleksandrs Saveljevs [ 2014 Feb 12 ]

(2) Another issue to discuss is item scheduling. Up until now we used two strategies for spreading items in time: based on item ID and based on interface ID (for JMX items and, since ZBX-7649, for ICMP pings). In the development branch the seed for scheduling SNMP items was changed from item ID to interface ID, so all items on the interface will be queried at the same time.

This is good and should work in most cases. However, with large switches like Cisco Nexus 9000 with hundreds and thousands of ports, there can be hundreds of thousands of items on a single host. Querying all these hundreds of thousands at the same time is not ideal either.

So scheduling should be improved and currently two approaches were thought of (a bit).

One is to schedule items based on "itemid - itemid % modulo", where "modulo" is small for interfaces with few items and large for interfaces with a large number of items.

Another is to specify in the server configuration file how many pollers can process items for a single interface. For instance, at most 5 pollers per interface, so that hundreds of pollers do not assault a single device.

This issue is likely to be split into a separate ZBX or ZBXNEXT.

wiper Regarding the first idea - we could take interfaceid and itemid % modulo (or itemid & bitmask), then calculate 32 bit checksum which would be used as the seed for nextcheck calculation. And like it was proposed the modulo value would depend on the number of items on the interface.

asaveljevs Issue split out to ZBXNEXT-2200. CLOSED.

Comment by Dimitri Bellini [ 2014 Feb 12 ]

I'm glade to see improvement on SNMP BULK, thanks so much

PS: I understand the problems related to massive polling to a single HOST (single IP interface for polling) because we have the same problem on Brocade switchs with near 300 of ports (10 items per ports).
I deal with your suggestion about a zabbix configuration parameter to limit the poller for a single interface.

Comment by richlv [ 2014 Mar 03 ]

(3) as for the docs - whatsnew of course; also the technical details from the description above must be added in some appropriate place, they are very useful

asaveljevs Updated documentation at the following locations:

asaveljevs RESOLVED.

wiper reviewed, CLOSED

Comment by Andris Zeila [ 2014 Mar 06 ]

(4) Processing of a single SNMP (also java) type item slightly differs from bulk processing. In the first case the call chain looks like: get_value() -> get_value_snmp() -> get_values_snmp() while in the second case get_values_snmp() is called directly from get_values().

We could directly call get_values_snmp() also when single SNMP item is retrieved from poller, making the process flow easier to understand.

In this case we might also consider to remove SNMP (and likewise java) item processing from get_value().

asaveljevs RESOLVED in r43387. I have left get_value_snmp() and get_value_java(), because they are useful convenience functions - the first one is used in network discovery and the second in internal checks.

wiper CLOSED

Comment by Andris Zeila [ 2014 Mar 06 ]

(5) dbconfig.c:3098

	return memcmp(&s1->snmp_community, &s2->snmp_community, 5 * ZBX_PTR_SIZE + 4 * sizeof(unsigned char));

It might be better to perform separate checks for each structure member to avoid any problems with member padding and future changes in structure members.

asaveljevs RESOLVED in r43391 and r43392.

wiper CLOSED

Comment by Andris Zeila [ 2014 Mar 13 ]

(6) Instead of adding lock parameter to DCconfig_get_interface_snmp_stats() function, it would be better to move its implementation without locking to a static function and call it directly from DCconfig_get_poller_items(), wile leaving DCconfig_get_interface_snmp_stats() as a public (and locking) wrapper to it.

asaveljevs RESOLVED in r43464.

wiper CLOSED

Comment by Andris Zeila [ 2014 Mar 14 ]

Looks good, tested.

Comment by Aleksandrs Saveljevs [ 2014 Mar 14 ]

Available in pre-2.2.3 r43468 and pre-2.3.0 (trunk) r43469.

Comment by Raymond Kuiper [ 2014 Apr 15 ]

Just wanted to mention this has reduced the amount of entries in our firewall session tables dramatically, thank you very much for implementing this feature!!

Comment by jpka [ 2014 Apr 18 ]

Please make this feature optional.
Related reading:
https://www.zabbix.com/forum/showthread.php?t=45001

Comment by Lane Bryson [ 2014 Apr 29 ]

echo the sentiment of jpka. ...it's killing my NetApps, to the point that I can't monitor them with Zabbix. jpka, is there a ticket for your issue, I'll vote on it.

Comment by Aleksandrs Saveljevs [ 2014 Apr 30 ]

Lane, jpka's issue is ZBX-8145, but it is not related to NetApp. I would propose to investigate this issue on a forum thread and open a ZBX when we find out the cause.

This thread might be most suitable: https://www.zabbix.com/forum/showthread.php?t=45200 .

Generated at Wed Jul 09 14:34:14 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.