[#ZBX-20744] False high-bandwidth monitoring values by snmp

[ZBX-20744] False high-bandwidth monitoring values by snmp Created: 2022 Mar 14 Updated: 2025 Nov 01 Resolved: 2025 Nov 01
Status:	Closed
Project:	ZABBIX BUGS AND ISSUES
Component/s:	Server (S)
Affects Version/s:	5.0.21
Fix Version/s:	None

Type:

Documentation task

Priority:

Trivial

Reporter:

Mikko Tikkala

Assignee:

Vladislavs Sokurenko

Resolution:

Fixed

Votes:

Labels:

None

Remaining Estimate:

Not Specified

Time Spent:

Original Estimate:

Not Specified

Environment:

Debian 10 VM
2vCPU, 4GB RAM
MariaDB

Attachments:

SNMP_bandwith_spike.png

Screenshot 2023-08-04 at 13.58.23.png

Screenshot 2023-09-29 at 13.17.14.png

Screenshot 2023-09-29 at 14.08.02.png

Screenshot 2023-09-29 at 14.08.12.png

image-2022-03-17-09-21-03-511.png

image-2023-09-29-14-23-04-565.png

Sprint:

Documentation backlog

Story Points:

0.125

Description

Issue has been discussed here:

https://www.zabbix.com/forum/zabbix-troubleshooting-and-problems/439582-zabbix-5-4-8-false-high-bandwidth-monitoring

Sometimes change per second calculation produces very big values (terabits per sec on 1gigabit interface). Usually faulty values are related to outbound traffic (ifhcoutoctets, ifoutunicastpkts). Raw snmp counter values seems to grow steadily, snmpwalk shows also reasonable values.

In my case faulty values didn't show up after bulk requests was disabled. Another user says that disabling bulk requests made faulty values more rare.

Could there be some overflow etc. which could mess values with bulk requests?

In my case there is Cisco IE-5000 switch with ~700 items. Polling cycle of items is 5min. Zabbix server has less than 100 NVPS.

Problems can be seen with following Zabbix versions

4.0.16, 5.0.10, 5.0.19, 5.0.21.

Comments

Comment by Rodrigo P [ 2022 Mar 15 ]

I see this same behavior on Juniper SRX and J series. Zabbix 5.2.6. I was not able to determine the cause yet.

What I can say is that it doesn't happen with others tools that collect the same data from these routers.

Comment by Edgars Melveris [ 2022 Mar 16 ]

Could you set up another item collecting the exact same OID, but without any preprocessing? Just raw values with no change per second.

Comment by Dimitri Bellini [ 2022 Mar 17 ]

I would like add to this threads, we discover the same on problem of one of our customer...
See this picture (SNMP_bandwith_spike.png) the Switch port is a 16Gbit port but we have spike of 80GBYTE...
We have disable the SNMP Bulk since the beginning because this switches are Brocade.
@DevTeam is a real Bug?

PS: I forgot to mention that we are using Zabbix 5.0.14 version

Thanks so much

Comment by Mikko Tikkala [ 2022 Mar 17 ]

Here is data from change per second item and raw item.

I wrote text below to forum thread yesterday (Zabbix 5.4.8 False High Bandwidth Monitoring - ZABBIX Forums):

"Made really interesting observation today when looked pcap-data.

In this case I have double items to read ifInUcastPkts OID.
One to calculate packet per second values and one for raw values.

This set seems to be related to item with raw values. Counter values increases at constantly.

"frame.time": "Mar 16, 2022 12:26:22.496782000 FLE Standard Time",
"1.3.6.1.2.1.2.2.1.11.10277: 4079198949"

"frame.time": "Mar 16, 2022 12:31:22.439778000 FLE Standard Time",
"1.3.6.1.2.1.2.2.1.11.10277: 4079859516" (diff to previous 660 567)

"frame.time": "Mar 16, 2022 12:36:22.514265000 FLE Standard Time",
"1.3.6.1.2.1.2.2.1.11.10277: 4080520255" (diff to previous 660 739)

"frame.time": "Mar 16, 2022 12:41:22.425662000 FLE Standard Time",
"1.3.6.1.2.1.2.2.1.11.10277: 4081180176" (diff to previous 659 921)

This set seems to be related to pps item.

"frame.time": "Mar 16, 2022 12:26:22.522068000 FLE Standard Time",
"1.3.6.1.2.1.2.2.1.11.10277: 4079198949"

"frame.time": "Mar 16, 2022 12:31:22.497455000 FLE Standard Time",
"1.3.6.1.2.1.2.2.1.11.10277: 4082365523" (diff to previous 3 166 574)

"frame.time": "Mar 16, 2022 12:36:22.556665000 FLE Standard Time",
"1.3.6.1.2.1.2.2.1.11.10277: 4080520255" (diff to previous -1 845 268)

"frame.time": "Mar 16, 2022 12:41:22.455305000 FLE Standard Time",
"1.3.6.1.2.1.2.2.1.11.10277: 4081180176" (diff to previous 659 921)

Negative counter values?!?!
It seems that Cisco IE-5000 switch sends faulty counter values when snmp bulk requests are used."

Comment by Edgars Melveris [ 2022 Mar 17 ]

It's a well known fact that some devices behave "interestingly" if bulk requests are used.
Can you check if the same happens with bulk turned off?

Just a wild guess is that it somehow mixes metrics from different indexes, then you really might see negative or huge spikes.

Comment by Mikko Tikkala [ 2022 Mar 17 ]

At least Cisco IE-5000 seems to be ok when bulk requests are turned off.

However other user said that disabling bulk requests didn't solve the problem completely.

Packet capture is the final truth... So I encourage users to capture traffic and look how the switch is responding.

Comment by Rodrigo P [ 2022 Nov 05 ]

Hi all. Has anyone made any progress on finding the root cause for this problem? Thanks.

Comment by Alex Le Heux [ 2023 Aug 04 ]

This comment is about an SNMP ifHCOutOctets item.

We are experiencing the same issue and I think that in our case it's related to the fact that some devices don't always report the SNMP value when polled. These are all devices that have a tiny CPU. This results in missing values in the item's history.

The next time the device is polled and the Change Per Second preprocessing is run, the following appears to happen:

The Change Per Second preprocessor sees the previous missing value and takes it to be 0. The ifHCOutOctets from the device is of course the total number of bytes sent since boot and is a very large value, causing the result to be ludicrously large.

Every single time this happens for us, the item's history has a missing entry, like this:

Comment by Vladislavs Sokurenko [ 2023 Sep 29 ]

It looks like values that are less than previous come

Comment by Alex Le Heux [ 2023 Sep 29 ]

All the "normal" values are deltas, so relative to the previous value.
The issue here is that when one value goes unreported by the device (for whatever reason, in our case usually devices with small CPUs) the next delta is calculated relative to 0. What Zabbix should do is ignore the missing value and use the previously recorded value.

Comment by Vladislavs Sokurenko [ 2023 Sep 29 ]

Is it possible that 0 was received ?
does something like this help

Comment by Alex Le Heux [ 2023 Sep 29 ]

I checked with tcpdump, and this happens when there's no reply for some reason.

But your suggestion looks very promising. I'm going to test that out right away! Thanks!

Comment by Vladislavs Sokurenko [ 2023 Sep 29 ]

History is always updated to new value, but when newer value comes that is less than previous then result is of calculation is discarded (that's why there is missing value), however when new value arrives then it checks change with this previous value, if previously value is incorrect (for example 0) then calculation is incorrect.
It is currently unknown what this value was, I can make a patch so that it is printed in log.

Comment by Vladislavs Sokurenko [ 2023 Sep 29 ]

Please also see ~~ZBX-14318~~, it seems there is confirmation indeed that sometimes device cannot handle bulk requests.

Comment by Alex Le Heux [ 2023 Sep 29 ]

I've played around with the Custom-on-fail setting and that doesn't help. I'll try turning off bulk requests, but we have hundreds of devices, so I'm not sure what other issues that'll cause.

I still think that the "change per second" function does the wrong thing by calculating the delta from 0 if there's one missing value though. That's the real issue.

Comment by Vladislavs Sokurenko [ 2023 Sep 29 ]

If it's possible then I can provide a patch that will print this value, value is missing because some value was returned that is lower than previous, if it's frequently reproducible then you could also monitor without simple change and see please if some new values are less than previous but then suddenly too big. Lets see if disabling combined requests help, currently not 100% sure if issue is in Zabbix or simply received wrong data.

Comment by Alex Le Heux [ 2023 Sep 29 ]

I've tcpdumped the traffic between the zabbix proxy and the device and in all cases that I checked the issue wasn't a wrong value, the issue was that no reply was sent for that OID.

But I can easily add an item without simple change to monitor and double check.

Comment by Alex Le Heux [ 2023 Sep 29 ]

OID monitored every 60 seconds, bulk requests enabled.

The OID with Change Per Second:
(note the errors when there's a missing value)

The same OID without change per second.

Comment by Vladislavs Sokurenko [ 2023 Sep 29 ]

What version of Zabbix server is it ? Maybe it is possible to increase log level for preprocessing worker and attach it ? That's quite unexpected that missing value affects delta calculation. Are raw results without any preprocessing ?

zabbix_server -R log_level_increase="preprocessing worker"

Comment by Alex Le Heux [ 2023 Sep 29 ]

We're on 6.0.19.

The server monitors over a thousand devices via 16 proxies so I'll have to see what I can do with regards to logging

Comment by Alex Le Heux [ 2023 Sep 29 ]

FYI: We'll run the server with extra logging on Monday. No production changes here on Friday afternoon. We like our weekends

Comment by Vladislavs Sokurenko [ 2023 Sep 29 ]

Preprocessing and collection is done through proxy and if only one interface is affected then maybe it's possible to create another proxy with this interface and checks. Unfortunately we cannot reproduce it locally for further investigation it might be required to patch proxy with additional debug information. It could be easier to investigate the issue if it's isolated.

Comment by Alex Le Heux [ 2023 Sep 29 ]

Ah, that's good to know. But this affects many devices, not just one, although it tends to be the devices with smaller CPUs.

Either way, we'll gather the logs for you on Monday!

Comment by Vladislavs Sokurenko [ 2023 Sep 29 ]

Thanks, if proxy is unpatched then it's better to increase log level for poller on proxy and then we can find time in log and compare it to log records for that time.

Comment by Alex Le Heux [ 2023 Oct 04 ]

Hi,
This took a bit longer than expected. I now have a 500+ MB log file that looks somewhat cryptic. Anything you want me to look for?

Comment by Vladislavs Sokurenko [ 2023 Oct 05 ]

If you could please grep for the time where there where missing entries and perhaps messages containing "snmp_sess_synch_response"

Comment by Alex Le Heux [ 2023 Oct 09 ]

zabbix:/home/zabbix$ grep snmp_sess_synch_response log/20231003-zabbix02-hsnc01_proxy_log-preprocessing-worker.log
zabbix:/home/zabbix$

Nothing. This is the log form the proxy.

Comment by Andrii Fediuk [ 2025 Oct 31 ]

Updated documentation:
- SNMP agent: 7.0, 7.2, 7.4, 8.0 (added information on how to handle false spikes).

Generated at Mon Jun 08 04:02:47 EEST 2026 using Jira 10.3.18#10030018-sha1:5642e4ad348b6c2a83ebdba689d04763a2393cab.

[ZBX-20744] False high-bandwidth monitoring values by snmp Created: 2022 Mar 14 Updated: 2025 Nov 01 Resolved: 2025 Nov 01