Loading...

Type: Change Request
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: 3.4.9
Component/s: Server (S)
Labels:
None
Environment:
CentOS 7

I've begun adding a large number of network devices to a Zabbix installation. These are mostly Cisco and Arista gear, doing switching or routing, and we care about things like port flaps, error counters, bandwidth, etc. Discovery templates are used for this and any given device may have six to eight items per physical port, so a 48-port switch could have as many as 384 items related to ports, plus the various hardware pieces like fans, temp, etc. so let's just say 400 items average per device.

At that item count, I've found that several Cisco devices cannot return the data in less than the Zabbix maximum timeout window of 30 seconds. I also have some Cisco switch stacks which have several thousand OID's to query since all the switches report via one management IP.

Arista gear seems to return the data faster, possibly due to a better SNMPv3 implementation or simply faster CPU's; their 48-port devices don't seem to have an issue getting the data out within Zabbix's timeout window. If you go to higher port count chassis though, you ultimately run into the same problem of the data not making it before timeout.

The end result of this issue is you cannot reliably use SNMPv3 with Zabbix and a reasonable number of OID's on switches with potentially anywhere from 24 ports (Cisco) or 96 (Arista) and higher. You will constantly have issues with items bouncing between not supported and supported, so your data will have gaps where polls have been missed. I've had no choice but to downgrade to SNMPv2 to get around this.

My feature request contains a few components:

1) At the host level, add a configuration setting for "Maximum SNMPv3 Responses Per Second". Note that I said responses per second and not queries per second, because the issue is not the queries, it's the responses taking longer than Zabbix is willing to wait. Since Zabbix expresses its wait timeout in seconds, knowing a rough limit on how many responses per second a given device can output would allow Zabbix to multiply that by its known timeout value, and then poll item OID's in batches no larger than (max snmpv3 * timeout) to ensure all the responses come before timeout.

2) To facilitate the above feature, SNMPv3 items would need to have an additional piece of internally calculated information which would be the polling bucket the given item ends up in. When a maximum SNMPv3 responses per second value is assigned, at that point Zabbix will calculate the maximum OID's per polling cycle, and divide the currently existing SNMPv3 items into polling buckets. For any given host, buckets cannot be polled concurrently, it must be sequentially, because I've found if you hit the same device with two bulk requests at the same time, the overall response time increases linearly, so the issue is likely just slow CPU's or slow SNMPv3 implementation on the device.

Once the buckets are created and items assigned to them, the pollers can work through them in the same order each time, which should allow the assigned query frequencies to roughly align. If more accuracy were desired, the buckets could even have offsets assigned to match Zabbix's minimum polling intervals for items on the respective host, so perhaps one bucket executes on the minute, the other always polls 30 seconds after the minute, so each bucket's one minute items are successfully queried.

3) If a given device has a number of items, and polling frequencies, that could never be met by a configured maximum responses per second, Zabbix should warn you. For example, let's say we have a big 10-slot chassis switch which, for the sake of easy math, has a 50-port ethernet blade in each slot, so 500 ports. We query for ten values per port; so 5000 items. We query those items once per minute, so roughly 84 responses per second would need to be possible from that device to meet Zabbix's queries. Let's say this device only outputs data at 50 OID's per second; if you type 50 in, Zabbix could warn you that you are going to miss data given the device cannot output responses fast enough for it to keep up with the queries.

This above is not a far fetched example. I have a network closet in an office building I was trying to monitor with SNMPv3; it's a stack of six Cisco 2960XR's with 48+2 ports each. I'm trying to collect eight OID's per port (name, admin status, operational status, errors in, errors out, last state change, bytes in, bytes out), so this host in zabbix has 2400 items. SNMPv3 takes several minutes to output that data, SNMPv2 makes it in under the 30 second timeout I have set.

4) To facilitate measuring what a given device is capable of, it would be extremely helpful if there were some way in the Zabbix web interface to go to a given host and display a list of OID's it would normally query for a given polling cycle. The documentation could be updated to instruct people how to use snmpbulkget or snmpbulkwalk to benchmark the response time and come up with an expected number of responses per second.

Details

Description

Attachments

Activity

People

Dates