[#ZBX-12956] queue calculation may give false positives for not fast *bulk* snmp operations

[ZBX-12956] queue calculation may give false positives for not fast bulk snmp operations Created: 2017 Oct 27 Updated: 2025 Jun 11 Resolved: 2017 Nov 13
Status:	Closed
Project:	ZABBIX BUGS AND ISSUES
Component/s:	None
Affects Version/s:	3.4.3
Fix Version/s:	None

Type:

Problem report

Priority:

Major

Reporter:

Oleksii Zagorskyi

Assignee:

Unassigned

Resolution:

Duplicate

Votes:

Labels:

bulk, queue, scheduling, snmpbulk

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original Estimate:

Not Specified

Attachments:

pollers.png

queue.png

Issue Links:

Duplicate
Sub-task
part of	ZBXNEXT-4103	provide a way to work in BULK mode fo...	Need info

Team:

Team C

Sprint:

Sprint 20, Sprint 21

Description

Not a bug , but ...

Just FYI, I'm using a patch from ZBXNEXT-4103 to let my SNMPv3 device a chance to work in bulk mode. But it's not related to current case.

But when I use it (bulk really works, with notes), I see an unexpected queue behavior - it's jumping.

When attempt to capture snmp traffic, I could see that maximal OIDs number my device may reach is close to ~60 and as I was able to figure out - it depends on data size the device should reply by.

Zabbix nicely splits on 2 parts and repeats requests in the same "snmp session" when it gets "error-status: tooBig".
That and plus that my device probably not that fast, causes that the whole host polling may take up to 10-20 seconds.
The host has 1000 items with 60 seconds update interval.

Internal monitoring: if increase update interval to queue measurement to 3 seconds, we can cleanly see those spikes. See graph.

Yes, I know that such items polling depends on interfaceID so scheduled time will be the same for all the 1000 items.

When I have 4 hosts and many pollers - it gets not any better.
I need to monitor much more such snmp hosts and much more items per host.

The request is to reconsider such behavior and maybe improve some part (scheduling or queue calculation).

I have bunch of snmp traffic captures with different number of pollers, 1 or 4 number of hosts, captured from server start and following 3-4 polling batches. Can be provided on request.

Comments

Comment by Andris Zeila [ 2017 Oct 30 ]

Ideally we should take in account the dc_interface->max_snmp_succeed when calculating the seed for nextchecks in get_item_nextcheck_seed() function. Instead of returning interfaceid if bulk is enabled we should return some hash based on interfaceid, number of items using this interface and max_snmp_succeed value. We do keep lists of snmp items by interface, so it should be possible. Something like interfaceid * itemid % (number of items / max_snmp_succeed) (when number of items > max_snmp_succeed).

Comment by Glebs Ivanovskis (Inactive) [ 2017 Nov 03 ]

Can someone tell me if this is somehow related to ZBXNEXT-3988?

zalex_ua I'd say the ZBXNEXT-3988 has "top level" influence on current case as well, but issue described here is an independent, specific one.

glebs.ivanovskis Thank you!

Comment by Glebs Ivanovskis (Inactive) [ 2017 Nov 08 ]

As a workaround, there is optional <from> parameter in zabbix[queue,<from>,<to>], it can be increased (default value is 6 seconds) to make queue readings more stable. It will introduce some latency in detecting delayed checks, but <from> should not be too high to account for slow SNMP polling and latencies will be tolerable.

Comment by Rostislav Palivoda (Inactive) [ 2017 Nov 13 ]

Continues under ZBXNEXT-4103

Comment by Oleksii Zagorskyi [ 2018 Jul 12 ]

Just want to leave here a note as for my statement "data size the device should reply by"

Here is an example output of "show snmp" management command of cisco switch:

Router# show snmp

Chassis: 01234567
37 SNMP packets input
    0 Bad SNMP version errors
    4 Unknown community name
    0 Illegal operation for community name supplied
    0 Encoding errors
    24 Number of requested variables
    0 Number of altered variables
    0 Get-request PDUs
    28 Get-next PDUs
    0 Set-request PDUs
78 SNMP packets output
    0 Too big errors (Maximum packet size 1500)
    0 No such name errors
    0 Bad values errors
    0 General errors
    24 Response PDUs
    13 Trap PDUs

SNMP logging: enabled
    Logging to 192.168.1.1.162, 0/10, 13 sent, 0 dropped.
SNMP Manager-role output packets
    4 Get-request PDUs
    4 Get-next PDUs
    6 Get-bulk PDUs
    4 Set-request PDUs
    23 Inform-request PDUs
    30 Timeouts
    0 Drops
SNMP Manager-role input packets
    0 Inform response PDUs
    2 Trap PDUs
    7 Response PDUs
    1 Responses with errors

SNMP informs: enabled
    Informs in flight 0/25 (current/max)
    Logging to 192.168.1.1.162
        4 sent, 0 in-flight, 1 retries, 0 failed, 0 dropped
    Logging to 192.168.1.1.162
        0 sent, 0 in-flight, 0 retries, 0 failed, 0 dropped

note please this part:

78 SNMP packets output
    0 Too big errors (Maximum packet size 1500)

looks like it says about what I supposed initially - if prepared reply will be more than 1500 bytes(it depends on values, whose length is variable/unpredictable), snmp device will reply with tooBig error.
Not because of OIDs number, but because of reply size in bytes!

Generated at Wed Jul 09 14:26:34 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.

[ZBX-12956] queue calculation may give false positives for not fast *bulk* snmp operations Created: 2017 Oct 27 Updated: 2025 Jun 11 Resolved: 2017 Nov 13

[ZBX-12956] queue calculation may give false positives for not fast bulk snmp operations Created: 2017 Oct 27 Updated: 2025 Jun 11 Resolved: 2017 Nov 13