[#ZBXNEXT-2200] approach to spreading items in time should be improved

[ZBXNEXT-2200] approach to spreading items in time should be improved Created: 2014 Mar 12 Updated: 2019 Jun 22
Status:	Open
Project:	ZABBIX FEATURE REQUESTS
Component/s:	Proxy (P), Server (S)
Affects Version/s:	2.2.2
Fix Version/s:	None

Type:

Change Request

Priority:

Minor

Reporter:

Aleksandrs Saveljevs

Assignee:

Unassigned

Resolution:

Unresolved

Votes:

Labels:

items, pollers, scheduling

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original Estimate:

Not Specified

Attachments:

zbxnext-2200-3h.png

zbxnext-2200.png

Issue Links:

Duplicate
is duplicated by	~~ZBX-8826~~	zabbix 2.2 snmp monitoring data loss ...	Closed

Description

This task is a continuation of ~~ZBXNEXT-98~~ and is meant to deal with new challenges that were brought up by its implementation.

We currently use two strategies for spreading items in time: based on item ID and based on interface ID (for JMX items (since ~~ZBXNEXT-555~~), for SNMP items (since ~~ZBXNEXT-98~~) and for ICMP pings (since ~~ZBX-7649~~)).

This is good and should work in most cases. However, in case of SNMP, with large switches like Cisco Nexus 9000 with hundreds and thousands of ports, there can be hundreds of thousands of items on a single host. Querying all these hundreds of thousands at the same time is not ideal.

So scheduling should be improved and currently two approaches were thought of.

One is to schedule items based on "itemid - itemid % modulo", where "modulo" is small for interfaces with few items and large for interfaces with a large number of items.

Another is to specify in the server configuration file how many pollers can process items for a single interface. For instance, at most 5 pollers per interface, so that hundreds of pollers do not assault a single device.

Comments

Comment by Aleksandrs Saveljevs [ 2014 Mar 12 ]

Regarding the first idea, wiper considers that "we could take interfaceid and itemid % modulo (or itemid & bitmask), then calculate 32 bit checksum which would be used as the seed for nextcheck calculation. And like it was proposed the modulo value would depend on the number of items on the interface.".

Comment by Filipe Paternot [ 2015 Sep 24 ]

The second idea seems more suitable. Due to SNMP priority on most equipments, if you start multiple queries at the same time, most of them are likely to fail due to concurrency.

On busy equipments (with tens of thousands of items) we are talking about hundreds of requests per second. It simply can't reply them all with the low priority it has (and that is fine as it should do it's primary business: route packets, serve files, be a load balancer..).

So, limiting the number of SNMP concurrent requests seems to be the right approach for this kind of equipment. But, we should do this with care, as it may impact on performance on other smaller hosts. Perhaps We should create a threshold that said if interface_items > X: then up to 5 pollers; else: default_behaviour.

Comment by Backoffice Team [ 2019 Jun 22 ]

A bit of current data to corroborate why this is a critical issue for large environments, i've attached two files with NVPS from one proxy with 13 hosts and 169061 items.

With a ~80 estimated NVPS, it manage to hit 8knvps for a brief period of time, couple times a day. We assume this happens mostly because item polling is not spread across time, so there are a lot of concurrent snmpgets for a short period so we get a lot of `SNMP agent item XYZ failed: first network error, ...`, giving us a console like this:

[root@server ~]# [PRODUCTION] docker-compose -f file.yaml logs --tail 10000 proxy | grep -c 'failed: first network error, wait for 20 seconds'

3671

[root@server ~]# [PRODUCTION]

This is bad for both Zabbix (for spikes, timeout managing and stuff) and for monitored host (way more get's than it can handle).

Perhaps this ticket can be addressed soon. I hear v5.0 will be about scalability, so maybe there will be room to monitor large hosts in there too.

Generated at Sat Apr 05 15:47:22 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.

[ZBXNEXT-2200] approach to spreading items in time should be improved Created: 2014 Mar 12 Updated: 2019 Jun 22