[ZBX-18876] SNMPv3 unstable use of snmpEngineTime results in usmStatsNotInTimeWindows Created: 2021 Jan 13  Updated: 2023 Jun 27

Status: Reopened
Project: ZABBIX BUGS AND ISSUES
Component/s: Proxy (P)
Affects Version/s: 5.0.6
Fix Version/s: None

Type: Problem report Priority: Trivial
Reporter: H.L. Assignee: Zabbix Support Team
Resolution: Unresolved Votes: 1
Labels: proxy, snmpv3
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

net-snmp.x86_64 1:5.7.2-49.el7 @centos7-base-x86_64
net-snmp-agent-libs.i686 1:5.7.2-49.el7 @centos7-base-x86_64
net-snmp-agent-libs.x86_64 1:5.7.2-49.el7 @centos7-base-x86_64
net-snmp-devel.i686 1:5.7.2-49.el7 @centos7-base-x86_64
net-snmp-libs.i686 1:5.7.2-49.el7 @centos7-base-x86_64
net-snmp-libs.x86_64 1:5.7.2-49.el7 @centos7-base-x86_64
net-snmp-perl.x86_64 1:5.7.2-49.el7 @centos7-base-x86_64
net-snmp-utils.x86_64 1:5.7.2-49.el7 @centos7-base-x86_64
zabbix-agent.x86_64 5.0.7-1.el7 @zabbix5.0
zabbix-proxy-mysql.x86_64 5.0.7-1.el7 @zabbix5.0
zabbix-release.noarch 5.0-1.el7 @zabbix4.0


Attachments: PNG File snmp_usmStatsNotInTimeWindows.png     PNG File snmp_usmStatsNotInTimeWindows_allpackets.png    
Issue Links:
Duplicate

 Description   

Hi,

we are monitoring a lot of snmpv3 devices with this zabbix proxy and have some issues with flapping snmp availability. After some troubleshooting I discovered that zabbix seams to switch the snmpEngineTime to unknown/unpredictable/wrong values, which causes the devices to be not stable.

I monitored all our traffic with wireshark on this poxy and I am sure there are no duplicate snmpEngineIDs. (Doublechecked)

Steps to reproduce:

  1. Monitor a lot of different snmpv3 devices from a zabbix proxy.
  2. Make sure to use more than 1 Poller. Our StartPollers value is 75.
  3. Make sure to have unique snmp engineIDs in the whole network. I checked this with wireshark.
  4. Wait for flapping snmp connectivity triggers
  5. Create a tcpdump of all snmp traffic on the proxy and analyze snmp communication.

Result:

Flapping snmp availability of the devices, because Zabbix uses the wrong snmpEngineTime Values as you can see in the screenshots. If you have a look in the "snmp_usmStatsNotInTimeWindows_allpackets.png" you will see, that there is no communication to other devices happening in the exact moment zabbix switches to the wrong snmpEngineTime. The "allpackets" screenshot shows you all snmp traffic on this machine. It seem zabbix switches to the wrong snmpEngineTime without reason and communication starts to get worse. The monitored device correctly reports usmStatsNotInTimeWindows, but zabbix seems not to care about. Further it seems that requests happen in parallel with correct and wrong snmpEngineTime. I guess this is because zabbix uses several poller to request the device, but only one poller uses the correct snmpEngineTime.

Expected:
No flapping snmp availability and zabbix proxy is using the correct snmpEngineTime for communication.



 Comments   
Comment by Oleksii Zagorskyi [ 2021 Jan 14 ]

Have you had a chance to read ZBX-8385 ?

It describes also cases when monitored device does not follow RFC requirements, which cause the issue.

Comment by H.L. [ 2021 Jan 15 ]

Hi Oleksii,

Thanks for your help and pointing me to ZBX-8385. After reading it I can say ZBX-8385 describes two possible reasons but does not fit to my discovered problem.

From ZBX-8385:

We can split the "usmStatsNotInTimeWindows" case to two possible reasons:
a) there indeed are monitored snmpV3 devices which have identical msgAuthoritativeEngineID

This is not the case. I doublechecked and there are no duplicate snmpEngineIDs.

b) some of devices have incorrect behavior and don't follow RFC3414. For example I know device which always use engineBoot=1, even after reboot.

This doesn't fit either. The monitored device did not reboot and in all SNMPv3 packets of this device the snmpEngineBoot value is 0. The snmpEngineTime further did not overflow, this the snmpEngineBoot value of 0 is correct.

But ZBX-8385 describes an interesting fact:

Zabbix server|proxy is a multi-process application. Every poller|unreachable_poller|and_some_other_process_type when starts - it loads libnetsnmp.

Every such process keeps its own libsnmp's in-RAM cache.What we want to consider is "etimelist" (see ldc_time.c)

This explains why it is even possible that zabbix starts to poll the device with correct snmpEngineTime and wrong snmpEngineTime in parallel.

In the moment that happened I see the "first network error, wait for 15 seconds" and "temporarily disabling SNMP agent checks on host: host unavailable" errors in the logs.

From this behaviour and RFC3414 I would suggest to update the poller code of zabbix. If any instance of poller receives an usmStatsNotInTimeWindows error report from a device, that same instance should update its snmpEngineTime and snmpEngineBoot values according to the values in the report and try again. The device should only be treated unreachable if the retry with updated snmpEngineTime and snmpEngineBoot values fails again. This would make zabbix further more stable from an enduser perspective.

Comment by Oleksii Zagorskyi [ 2021 Jan 15 ]

This is your friend: https://www.zabbix.com/documentation/5.0/manual/introduction/whatsnew500#manual_snmp_cache_clearing

Nothing add on zabbix side, this issue to be closed.

Comment by Oleksii Zagorskyi [ 2021 Jan 15 ]

Don't get me wrong on closing current issue. But taking into account that you found new important details, this one should be closed.

If you still think that something is wrong, create a new issue taking into account all new details.

Comment by H.L. [ 2021 Jan 15 ]

Yes, I absolutely don't understand why you close this case. I found an issue in zabbix, documented and reported it and suggested how to solve it. You pointed me to manually clear the snmp cache to "solve" this issue. This might be a workaround until the code gets fixed. Zabbix is promoted to "Automate monitoring of large, dynamic environments". How does this fit to manually clear snmp cache from time to time because we won't fix our code. Zabbix behaves unstable and the users have to manually intervene to have it working until it gets again unstable. I did invest a lot of time to analyze this issue to make zabbix a better software and you just ignore it. Great job, Thanks.

Comment by H.L. [ 2021 Jan 15 ]

Taking into account all new details it is still the same: Zabbix behaves unstable because it doesn't correctly track snmpEngineTime with multiple pollers. Manually clearing the snmp cache is not a solution, it is only a workaround.

Comment by Oleksii Zagorskyi [ 2021 Jan 15 ]

I've spent quite a lot of time to work with SNMP, especially v3, so now I'm pretty confident to say that there is no bug in zabbix.
Maybe there us something not like you expected or wanted, but that's not a bug, I'm sorry.

The RFC does require to not "track snmpEngineTime" but reject it because of security aspects.
If device does not follow RFC - that's not zabbix fault.

Comment by H.L. [ 2021 Jan 15 ]

So, how do you explain that the zabbix poller process starts to use a complete made up snmpEngineTime value? In all my traces the monitored devices behave RFC compliant. They increase their values correctly like they should. Only Zabbix starts out of nothing to use weird snmpEngineTime values without any known reason. I even filtered my traces to find any other device having a snmpEngineTime value like the one Zabbix made up, but didn't find one. Nothing tells Zabbix to invent snmpEngineTime values.

Comment by Oleksii Zagorskyi [ 2021 Jan 18 ]

That all made by library itself, not by zabbix.
It's possible that small number of zabbix server pollers "caught" the higher EngineTime for duplicated EngineID some time ago and continued to "count" the clock (in library's cache). Or device was rebooted.

snmpEngineBoot=0 (as you said) is almost 100% grantee (IMO) that the device does not follow RFC.

Try to restart zabbix or clean the cache I implemented.
I'm sorry, but I do not see much sense to argue here.

Comment by Craig Hopkins [ 2022 Sep 02 ]

Related to ZBX-21557 ?

Generated at Wed May 07 06:33:07 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.