[#ZBXNEXT-98] Use SNMP getbulk for OID retrieval

(1) There are two places where getting multiple values in a single request was implemented.

The first one is getting regular items or verifying indices for dynamic SNMP item cache. This is done in function zbx_snmp_get_values() in src/zabbix_server/poller/checks_snmp.c. There, a GetRequest-PDU is used with multiple variable bindings.

The second one is walking the OID trees for dynamic SNMP item caching purposes and for SNMP discovery. This is done in function zbx_snmp_walk() in the same file. There, a GetNextRequest-PDU is used for SNMPv1 (same as before) and GetBulkRequest-PDU is used for SNMPv2 and SNMPv3.

Let's start with the first case. As mentioned on the Net-SNMP mailing list above, if we query a device for multiple values, it can either (a) return a proper response, (b) return a "tooBig(1)" error or (c) do not respond at all, yielding a timeout.

So we have to find a way to find the optimal number of values to query for each device. The way it is currently implemented in the development branch is we start cautiously with 1 value per query. If that succeeds, the next time we query 2 values at once. If that succeeds, we query 3 and so on, up to 128 values in a single request.

If we query N values and the request fails (we either get a "tooBig(1)" or timeout), the general approach is to retry with N/2 values and remember that we should not try requesting N values again, but let N-1 be our maximum.

The approach is summarized more precisely in the following comment in the source code:

/* Since we are trying to obtain multiple values from the SNMP agent, the response that it has to  */
/* generate might be too big. It seems to be required by the SNMP standard that in such cases the  */
/* error status should be set to "tooBig(1)". However, some devices simply do not respond to such  */
/* queries and we get a timeout. Moreover, some devices exhibit both behaviors - they either send  */
/* "tooBig(1)" or do not respond at all. So what we do is halve the number of variables to query - */
/* it should work in the vast majority of cases, because, since we are now querying "num" values,  */
/* we know that querying "num/2" values succeeded previously. The case where it can still fail due */
/* to exceeded maximum response size is if we are now querying values that are unusually large. So */
/* if querying with half the number of the last values does not work either, we resort to querying */
/* values one by one, and the next time configuration cache gives us items to query, it will give  */
/* us less. */

So the expected server behavior is that it starts cautiously with a higher load, then gradually (but rather quickly) the load is reduced due to the fact that more and more values are queried in a single request, and finally stabilizes once the limit for each device is determined.

Now, this is good, but would it be possible to come up with a better approach? It is also relevant for GetBulkRequest-PDU, because some devices give a timeout, if max-repetitions field is too high. If we implement the same "start with 1 and increase" strategy for walking, the problem is that the bulk effect will be observed not as soon as with regular polling, because items like discovery rules are meant to run less frequently. So, if we do, the limit counter should probably be the same as for GetRequest-PDU.

Currently, for GetBulkRequest-PDU, the max-repetitions field is hardcoded to 10 and this is something I would like to discuss with the reviewer and improve.

wiper I think we can safely use the same logic for max-repetitions increase. While the bulk effect might build up slower than with regular polling, it also means there are less requests - so less load on the target system.

And as discussed instead of increasing the number of values per request by 1 we might increase it 1.5 times to hit the limit faster.

asaveljevs I have implemented the strategy of multiplying by 1.5 and applied that to "max-repetitions", too, in r43428, r43429 and r43432. RESOLVED.

wiper A nice optimization would be to multiply the number by 1.5 until the first fail, then drop back to last succeeded number and continue to increment it by 1 until it fails.
REOPENED

[ZBXNEXT-98] Use SNMP getbulk for OID retrieval Created: 2009 Oct 06 Updated: 2014 Apr 30 Resolved: 2014 Mar 13
Status:	Closed
Project:	ZABBIX FEATURE REQUESTS
Component/s:	Proxy (P), Server (S)
Affects Version/s:	None
Fix Version/s:	2.2.3, 2.3.0

[ZBXNEXT-98] Use SNMP getbulk for OID retrieval Created: 2009 Oct 06 Updated: 2014 Apr 30 Resolved: 2014 Mar 13