[ZBXNEXT-2200] approach to spreading items in time should be improved Created: 2014 Mar 12 Updated: 2019 Jun 22 |
|
Status: | Open |
Project: | ZABBIX FEATURE REQUESTS |
Component/s: | Proxy (P), Server (S) |
Affects Version/s: | 2.2.2 |
Fix Version/s: | None |
Type: | Change Request | Priority: | Minor |
Reporter: | Aleksandrs Saveljevs | Assignee: | Unassigned |
Resolution: | Unresolved | Votes: | 7 |
Labels: | items, pollers, scheduling | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: |
![]() ![]() |
||||||||
Issue Links: |
|
Description |
This task is a continuation of We currently use two strategies for spreading items in time: based on item ID and based on interface ID (for JMX items (since This is good and should work in most cases. However, in case of SNMP, with large switches like Cisco Nexus 9000 with hundreds and thousands of ports, there can be hundreds of thousands of items on a single host. Querying all these hundreds of thousands at the same time is not ideal. So scheduling should be improved and currently two approaches were thought of. One is to schedule items based on "itemid - itemid % modulo", where "modulo" is small for interfaces with few items and large for interfaces with a large number of items. Another is to specify in the server configuration file how many pollers can process items for a single interface. For instance, at most 5 pollers per interface, so that hundreds of pollers do not assault a single device. |
Comments |
Comment by Aleksandrs Saveljevs [ 2014 Mar 12 ] |
Regarding the first idea, wiper considers that "we could take interfaceid and itemid % modulo (or itemid & bitmask), then calculate 32 bit checksum which would be used as the seed for nextcheck calculation. And like it was proposed the modulo value would depend on the number of items on the interface.". |
Comment by Filipe Paternot [ 2015 Sep 24 ] |
The second idea seems more suitable. Due to SNMP priority on most equipments, if you start multiple queries at the same time, most of them are likely to fail due to concurrency. On busy equipments (with tens of thousands of items) we are talking about hundreds of requests per second. It simply can't reply them all with the low priority it has (and that is fine as it should do it's primary business: route packets, serve files, be a load balancer..). So, limiting the number of SNMP concurrent requests seems to be the right approach for this kind of equipment. But, we should do this with care, as it may impact on performance on other smaller hosts. Perhaps We should create a threshold that said if interface_items > X: then up to 5 pollers; else: default_behaviour. |
Comment by Backoffice Team [ 2019 Jun 22 ] |
A bit of current data to corroborate why this is a critical issue for large environments, i've attached two files with NVPS from one proxy with 13 hosts and 169061 items.
With a ~80 estimated NVPS, it manage to hit 8knvps for a brief period of time, couple times a day. We assume this happens mostly because item polling is not spread across time, so there are a lot of concurrent snmpgets for a short period so we get a lot of `SNMP agent item XYZ failed: first network error, ...`, giving us a console like this:
[root@server ~]# [PRODUCTION] docker-compose -f file.yaml logs --tail 10000 proxy | grep -c 'failed: first network error, wait for 20 seconds'
3671
[root@server ~]# [PRODUCTION]
This is bad for both Zabbix (for spikes, timeout managing and stuff) and for monitored host (way more get's than it can handle).
Perhaps this ticket can be addressed soon. I hear v5.0 will be about scalability, so maybe there will be room to monitor large hosts in there too. |