[ZBX-18876] SNMPv3 unstable use of snmpEngineTime results in usmStatsNotInTimeWindows Created: 2021 Jan 13 Updated: 2023 Jun 27 |
|
Status: | Reopened |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Proxy (P) |
Affects Version/s: | 5.0.6 |
Fix Version/s: | None |
Type: | Problem report | Priority: | Trivial |
Reporter: | H.L. | Assignee: | Zabbix Support Team |
Resolution: | Unresolved | Votes: | 1 |
Labels: | proxy, snmpv3 | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
net-snmp.x86_64 1:5.7.2-49.el7 @centos7-base-x86_64 |
Attachments: |
![]() ![]() |
||||
Issue Links: |
|
Description |
Hi, we are monitoring a lot of snmpv3 devices with this zabbix proxy and have some issues with flapping snmp availability. After some troubleshooting I discovered that zabbix seams to switch the snmpEngineTime to unknown/unpredictable/wrong values, which causes the devices to be not stable. I monitored all our traffic with wireshark on this poxy and I am sure there are no duplicate snmpEngineIDs. (Doublechecked) Steps to reproduce:
Result: Flapping snmp availability of the devices, because Zabbix uses the wrong snmpEngineTime Values as you can see in the screenshots. If you have a look in the "snmp_usmStatsNotInTimeWindows_allpackets.png" you will see, that there is no communication to other devices happening in the exact moment zabbix switches to the wrong snmpEngineTime. The "allpackets" screenshot shows you all snmp traffic on this machine. It seem zabbix switches to the wrong snmpEngineTime without reason and communication starts to get worse. The monitored device correctly reports usmStatsNotInTimeWindows, but zabbix seems not to care about. Further it seems that requests happen in parallel with correct and wrong snmpEngineTime. I guess this is because zabbix uses several poller to request the device, but only one poller uses the correct snmpEngineTime. Expected: |
Comments |
Comment by Oleksii Zagorskyi [ 2021 Jan 14 ] |
Have you had a chance to read It describes also cases when monitored device does not follow RFC requirements, which cause the issue. |
Comment by H.L. [ 2021 Jan 15 ] |
Hi Oleksii, Thanks for your help and pointing me to From We can split the "usmStatsNotInTimeWindows" case to two possible reasons: This is not the case. I doublechecked and there are no duplicate snmpEngineIDs. b) some of devices have incorrect behavior and don't follow RFC3414. For example I know device which always use engineBoot=1, even after reboot. This doesn't fit either. The monitored device did not reboot and in all SNMPv3 packets of this device the snmpEngineBoot value is 0. The snmpEngineTime further did not overflow, this the snmpEngineBoot value of 0 is correct. But Zabbix server|proxy is a multi-process application. Every poller|unreachable_poller|and_some_other_process_type when starts - it loads libnetsnmp. Every such process keeps its own libsnmp's in-RAM cache.What we want to consider is "etimelist" (see ldc_time.c) This explains why it is even possible that zabbix starts to poll the device with correct snmpEngineTime and wrong snmpEngineTime in parallel. In the moment that happened I see the "first network error, wait for 15 seconds" and "temporarily disabling SNMP agent checks on host: host unavailable" errors in the logs. From this behaviour and RFC3414 I would suggest to update the poller code of zabbix. If any instance of poller receives an usmStatsNotInTimeWindows error report from a device, that same instance should update its snmpEngineTime and snmpEngineBoot values according to the values in the report and try again. The device should only be treated unreachable if the retry with updated snmpEngineTime and snmpEngineBoot values fails again. This would make zabbix further more stable from an enduser perspective. |
Comment by Oleksii Zagorskyi [ 2021 Jan 15 ] |
This is your friend: https://www.zabbix.com/documentation/5.0/manual/introduction/whatsnew500#manual_snmp_cache_clearing Nothing add on zabbix side, this issue to be closed. |
Comment by Oleksii Zagorskyi [ 2021 Jan 15 ] |
Don't get me wrong on closing current issue. But taking into account that you found new important details, this one should be closed. If you still think that something is wrong, create a new issue taking into account all new details. |
Comment by H.L. [ 2021 Jan 15 ] |
Yes, I absolutely don't understand why you close this case. I found an issue in zabbix, documented and reported it and suggested how to solve it. You pointed me to manually |
Comment by H.L. [ 2021 Jan 15 ] |
Taking into account all new details it is still the same: Zabbix behaves unstable because it doesn't correctly track snmpEngineTime with multiple pollers. Manually clearing the snmp cache is not a solution, it is only a workaround. |
Comment by Oleksii Zagorskyi [ 2021 Jan 15 ] |
I've spent quite a lot of time to work with SNMP, especially v3, so now I'm pretty confident to say that there is no bug in zabbix. The RFC does require to not "track snmpEngineTime" but reject it because of security aspects. |
Comment by H.L. [ 2021 Jan 15 ] |
So, how do you explain that the zabbix poller process starts to use a complete made up snmpEngineTime value? In all my traces the monitored devices behave RFC compliant. They increase their values correctly like they should. Only Zabbix starts out of nothing to use weird snmpEngineTime values without any known reason. I even filtered my traces to find any other device having a snmpEngineTime value like the one Zabbix made up, but didn't find one. Nothing tells Zabbix to invent snmpEngineTime values. |
Comment by Oleksii Zagorskyi [ 2021 Jan 18 ] |
That all made by library itself, not by zabbix. snmpEngineBoot=0 (as you said) is almost 100% grantee (IMO) that the device does not follow RFC. Try to restart zabbix or clean the cache I implemented. |
Comment by Craig Hopkins [ 2022 Sep 02 ] |
Related to |