[ZBXNEXT-2200] approach to spreading items in time should be improved Created: 2014 Mar 12  Updated: 2019 Jun 22

Status: Open
Project: ZABBIX FEATURE REQUESTS
Component/s: Proxy (P), Server (S)
Affects Version/s: 2.2.2
Fix Version/s: None

Type: Change Request Priority: Minor
Reporter: Aleksandrs Saveljevs Assignee: Unassigned
Resolution: Unresolved Votes: 7
Labels: items, pollers, scheduling
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File zbxnext-2200-3h.png     PNG File zbxnext-2200.png    
Issue Links:
Duplicate
is duplicated by ZBX-8826 zabbix 2.2 snmp monitoring data loss ... Closed

 Description   

This task is a continuation of ZBXNEXT-98 and is meant to deal with new challenges that were brought up by its implementation.

We currently use two strategies for spreading items in time: based on item ID and based on interface ID (for JMX items (since ZBXNEXT-555), for SNMP items (since ZBXNEXT-98) and for ICMP pings (since ZBX-7649)).

This is good and should work in most cases. However, in case of SNMP, with large switches like Cisco Nexus 9000 with hundreds and thousands of ports, there can be hundreds of thousands of items on a single host. Querying all these hundreds of thousands at the same time is not ideal.

So scheduling should be improved and currently two approaches were thought of.

One is to schedule items based on "itemid - itemid % modulo", where "modulo" is small for interfaces with few items and large for interfaces with a large number of items.

Another is to specify in the server configuration file how many pollers can process items for a single interface. For instance, at most 5 pollers per interface, so that hundreds of pollers do not assault a single device.



 Comments   
Comment by Aleksandrs Saveljevs [ 2014 Mar 12 ]

Regarding the first idea, wiper considers that "we could take interfaceid and itemid % modulo (or itemid & bitmask), then calculate 32 bit checksum which would be used as the seed for nextcheck calculation. And like it was proposed the modulo value would depend on the number of items on the interface.".

Comment by Filipe Paternot [ 2015 Sep 24 ]

The second idea seems more suitable. Due to SNMP priority on most equipments, if you start multiple queries at the same time, most of them are likely to fail due to concurrency.

On busy equipments (with tens of thousands of items) we are talking about hundreds of requests per second. It simply can't reply them all with the low priority it has (and that is fine as it should do it's primary business: route packets, serve files, be a load balancer..).

So, limiting the number of SNMP concurrent requests seems to be the right approach for this kind of equipment. But, we should do this with care, as it may impact on performance on other smaller hosts. Perhaps We should create a threshold that said if interface_items > X: then up to 5 pollers; else: default_behaviour.

Comment by Backoffice Team [ 2019 Jun 22 ]

A bit of current data to corroborate why this is a critical issue for large environments, i've attached two files with NVPS from one proxy with 13 hosts and 169061 items.

 

With a ~80 estimated NVPS, it manage to hit 8knvps for a brief period of time, couple times a day. We assume this happens mostly because item polling is not spread across time, so there are a lot of concurrent snmpgets for a short period so we get a lot of `SNMP agent item XYZ failed: first network error, ...`, giving us a console like this:

 

[root@server ~]# [PRODUCTION] docker-compose -f file.yaml logs --tail 10000 proxy | grep -c 'failed: first network error, wait for 20 seconds'

3671

[root@server ~]# [PRODUCTION]

 

This is bad for both Zabbix (for spikes, timeout managing and stuff) and for monitored host (way more get's than it can handle).

 

Perhaps this ticket can be addressed soon. I hear v5.0 will be about scalability, so maybe there will be room to monitor large hosts in there too.

Generated at Sat Apr 05 15:47:22 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.