[ZBX-8385] snmpV3 report (response) "usmStatsNotInTimeWindows" treated as NETWORK_ERROR, which is bad and may mislead Created: 2014 Jun 23 Updated: 2025 Mar 18 Resolved: 2024 Mar 25 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Proxy (P), Server (S) |
Affects Version/s: | 2.2.4rc3, 2.3.0 |
Fix Version/s: | None |
Type: | Problem report | Priority: | Trivial |
Reporter: | Oleksii Zagorskyi | Assignee: | Dmitrijs Goloscapovs |
Resolution: | Workaround proposed | Votes: | 30 |
Labels: | availability, snmpv3 | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: |
![]() ![]() ![]() ![]() |
||||||||||||||||||||
Issue Links: |
|
||||||||||||||||||||
Team: | |||||||||||||||||||||
Sprint: | Support backlog, S24-W10/11 |
Description |
This is continuation of |
Comments |
Comment by Oleksii Zagorskyi [ 2014 Jun 23 ] |
First of all some excerpts from RFC3414:
We can split the "usmStatsNotInTimeWindows" case to two possible reasons: If consider both cases we can see that actually devices (agent) do respond to zabbix (manager) w/o delays with report "SNMP-USER-BASED-SM-MIB::usmStatsNotInTimeWindows.0 (1.3.6.1.6.3.15.1.1.2.0)", but zabbix after its Timeout threads it as a network error. Usually this report returned with authNoPriv level (as defined by the RFC3414) so these packets can bee seen in Wiresharc w/o decryption (confirmed for net-snmp daemon). Here is example of a device (case b)): No. Time Source Destination Length Protocol Info SRC port Engine-Boots Engine-Time Priv Auth Engine-MAC 61 16:14:22.027265 zabbix51 10.219.156.110 106 SNMP get-request 43532 0 0 Not set Not set 62 16:14:22.028109 10.219.156.110 zabbix51 149 SNMP report SNMP-USER-BASED-SM-MIB::usmStatsUnknownEngineIDs.0 161 1 1383217 Not set Not set 00:50:56:b4:30:a9 63 16:14:22.028234 zabbix51 10.219.156.110 191 SNMP encryptedPDU: privKey Unknown 43532 1 3205700 Set Set 00:50:56:b4:30:a9 64 16:14:22.028735 10.219.156.110 zabbix51 174 SNMP report SNMP-USER-BASED-SM-MIB::usmStatsNotInTimeWindows.0 161 1 1383217 Not set Set 00:50:56:b4:30:a9 Corresponding log line (server with Timeout=30): 18584:20140619:161452.029 SNMP agent item "ApplianceGGAgentProcessState" on host "tress" failed: first network error, wait for 15 seconds Here I want to explain one thing which were not discussed at all (or just was mentioned briefly - in asaveljevs' quote in Zabbix server|proxy is a multi-process application. Every poller|unreachable_poller|and_some_other_process_type when starts - it loads libnetsnmp. To make sure try for example add two parameters doDebugging 1, debugTokens lcd to /etc/snmp/snmp.conf and restart zabbix server - you will see how many blocks of STDERR lines - how many times the library loaded by zabbix processes. Every such process keeps its own libsnmp's in-RAM cache.What we want to consider is "etimelist" (see ldc_time.c) typedef struct enginetime_struct { u_char *engineID; u_int engineID_len; u_int engineTime; u_int engineBoot; /* * Time & boots values received from last authenticated * * message within the previous time window. */ time_t lastReceivedEngineTime; /* * Timestamp made when engineTime/engineBoots was last * * updated. Measured in seconds. */ I checked it many times and with different ways and I'm sure that every zabbix process keeps its own independent sample of "etimelist" structure. What it means - for example one/several zabbix poller successfully monitored a snmp device. Such process already keeps device's values of engineID, engineBoot, engineTime. In general the same picture will be if we add to zabbix another snmp device with the same engineID. But here possible two variants:
And ... at this point unreachable_pollers will come into play As result in zabbix log we can seen endless flapped events: "SNMP agent item ... failed: first network error", "resuming SNMP agent checks ... connection restored" generated by poller and unreachable_poller respectively. Imagine that items pooling is distributed between all available pollers|unreachable_pollers randomly - and it becomes even more unpredictable. This issue in not possible to reproduce using single snmpget command, because every time when we run the command, it creates its own similar cache and populates it by correct engineBoot, engineTime values, initially received from snmp device. This cache is destroyed after the command is finished. Here is two debug lines of two different snmp checks: -- Real timeot: 28452:20140619:173453.200 zbx_snmp_get_values() snmp_synch_response() status:1 errstat:-1 mapping_num:1 -- usmStatsNotInTimeWindows report: 27661:20140619:172253.954 zbx_snmp_get_values() snmp_synch_response() status:2 errstat:-1 mapping_num:1 added: see my comment dated 2015 Jan 16 01:24 for such lines in recent zabbix versions. We see that zabbix definitely sees some difference in "status", which is taken from libsnmp. What I suggest - zabbix should differentiate such cases and do not thread device result pooling as NETWORK_ERROR if there were report packets from it. Well, you can say that if zabbix faced with usmStatsNotInTimeWindows case, then logically any snmpV3 checks will not work on the device at all, so marking snmp-host as unreachable is a logical consequence. So I still suggest to mark failed items as not supported. |
Comment by Tobias van Hoogen [ 2014 Jun 24 ] |
Oleksiy, This almost seems you were sitting in a bath and had a EUREKA moment today! Fully has my vote, it's a big issue. Thank you for diving into it. |
Comment by Oleksii Zagorskyi [ 2014 Jun 24 ] |
ohh, forgot to describe how most likely easy emulate usmStatsNotInTimeWindows error. Of course you can configure 2 snmp devices in network with identical engineIDs (case a)). But more easy is to generate case b) At this point (depending on how long snmpd was stopped and was the item pooled with timeout during this period or not) one of pollers will check the Linux host. The /var/lib/snmp/snmpd.conf file is not a general config file, it's a "persistent data" file where snmpd daemon stores run-time values which should be preserved between restarts. |
Comment by Oleksii Zagorskyi [ 2014 Jun 24 ] |
Just FYI:
They could be considered to be used to cardinally resolve current issue (for example we could monitor even duplicated engineIDs), but after 2 days of thinking on this I tend to think it will not very good. |
Comment by Andris Zeila [ 2014 Jun 26 ] |
I don't think we can rely on the status code returned by snmp_synch_response(), because 2 is timeout code (STAT_TIMEOUT) and 1 is error code (STAT_ERROR). If the device is down during connection snmp_synch_response will return STAT_ERROR (because snmpv3_engineID_probe() fails with "unable to determine remote engine ID" error). But in the case of real timeout snmp_synch_response() should return STAT_TIMEOUT (haven't tested it though) and there are no easy way to tell if it was timeout or usmStatsNotInTimeWindows error. At least not on higher level - we might get something when using lower level functionality (basically reimplementing snmp_synch_response()) or providing custom security model which would store the error code. I took a look at net-snmp code - the usmStatsNotInTimeWindows simply causes snmp_parse() to fail with SNMPERR_USM_NOTINTIMEWINDOW which is not stored/returned anywhere. So maybe dropping the engine from cache is the right way. Though on a quick test I had no luck with free_enginetime(). free_etimelist() was working as expected, but that might be an overkill. wiper I found out that I was using wrong engineid with enginetime(), so that explains why I could not reset the time stats with it. zalex_ua fixed two typos in comment above: :->", snmp_sync_response->snmp_synch_response |
Comment by Oleksii Zagorskyi [ 2014 Jun 27 ] |
One interesting point which is not very noticeable is "swapped" values of status for zabbix's and library's point of view. I already mentioned that: *** Zabbix: *** -- Real timeot: 28452:20140619:173453.200 zbx_snmp_get_values() snmp_synch_response() status:1 errstat:-1 mapping_num:1 -- usmStatsNotInTimeWindows report: 27661:20140619:172253.954 zbx_snmp_get_values() snmp_synch_response() status:2 errstat:-1 mapping_num:1 added: see my comment dated 2015 Jan 16 01:24 for such lines in recent zabbix versions. wiper has mentioned above: *** library *** snmp_client.h: struct synch_state { int waiting; int status; /* * status codes */ #define STAT_SUCCESS 0 #define STAT_ERROR 1 #define STAT_TIMEOUT 2 By "swapped" values I mean this resulting matches: #define STAT_ERROR 1 = zabbix Real timeot: #define STAT_TIMEOUT 2 = zabbix usmStatsNotInTimeWindows report Just need to be carefull, to be not confused. I'm not so power in linetsnmp internals, so I want to explaine.
/*
* PDU types in SNMPv2u, SNMPv2*, and SNMPv3
*/
#define SNMP_MSG_REPORT (ASN_CONTEXT | ASN_CONSTRUCTOR | 0x8) /* a8=168 */
and in my screenshot above we see that "data" started from 0xa8, so I suppose it's how it detected. So, I think zabbix should detect that these actually was "report" responce from snmp agent, parse and print to zabbix server log (as I suggested with engineID, engineBoot, engineTime and OID from report). Other details: In net-snmp source code I see this: snmpv3_make_report(netsnmp_pdu *pdu, int error) { long ltmp; static oid unknownSecurityLevel[] = { 1, 3, 6, 1, 6, 3, 15, 1, 1, 1, 0 }; static oid notInTimeWindow[] = { 1, 3, 6, 1, 6, 3, 15, 1, 1, 2, 0 }; static oid unknownUserName[] = { 1, 3, 6, 1, 6, 3, 15, 1, 1, 3, 0 }; static oid unknownEngineID[] = { 1, 3, 6, 1, 6, 3, 15, 1, 1, 4, 0 }; static oid wrongDigest[] = { 1, 3, 6, 1, 6, 3, 15, 1, 1, 5, 0 }; static oid decryptionError[] = { 1, 3, 6, 1, 6, 3, 15, 1, 1, 6, 0 }; oid *err_var; int err_var_len; int stat_ind; struct snmp_secmod_def *sptr; ... I guess these are OIDs which net-snmp (as agent) can use in reports. About "dropping the engine from cache" approach - currently I would not want that we will use it for all cases. Yeah, it will be pretty nice and cool solution for case b) (incorrect agent behavior after reboot), because all pollers which store "outdated/wrong" snmp cache entries after some period will use only correct ones, and that's cool. But for case a) (duplicate engineIDs) it will produce almost the same picture, depending when and on what conditions we will free_enginetime() on failed session. free_etimelist() approach should not be considered because it will breack up ZBXNEXT-2352 idea |
Comment by Andris Zeila [ 2014 Jun 27 ] |
As I said we will have to use lower level API. The snmp_synch_response() function we are using considers usmStatsNotInTimeWindows response as a parsing error and keeps waiting for the right packet. Then it timeouts and returns NULL pdu - so there is nothing for us examine. I think dropping engine from cache would be fine in usmStatsNotInTimeWindows situation. Otherwise the buggy devices would be unaccessible until server/proxy restart anyway. But if we would decide to simply drop engine from cache in the case of STAT_TIMEOUT, then we would need engine caching you were talking in ZBXNEXT-2352 (otherwise we would not know engine id to drop) . |
Comment by Oleksii Zagorskyi [ 2014 Jun 27 ] |
Monitoring of buggy devices (case b) ) is NOT a goal of current issue report. hmm, I cannot imagine why we need ZBXNEXT-2352 to drop particular engineID when we already got usmStatsNotInTimeWindows report/response. just in case - here is example with debug by "lcd*" token: # snmpget -v 3 -a MD5 -A publicV3 -l authPriv -u publicV3 -X publicV3 localhost .1.3.6.1.2.1.1.1.0 -Dlcd registered debug token lcd, 1 registered debug token lcd, 1 -- next line is related to manager (running snmpget command, ignore it) lcd_set_enginetime: engineID 80 00 1F 88 80 CB 3E 29 01 05 12 A3 53 00 00 00 00 : boots=1, time=0 lcd_set_enginetime: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 00 00 : boots=0, time=0 lcd_set_enginetime: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 00 00 : boots=22, time=8152 lcd_get_enginetime: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 00 00 : boots=22, time=8152 lcd_set_enginetime: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 00 00 : boots=0, time=0 lcd_get_enginetime_ex: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 00 00 : boots=0, time=0 lcd_set_enginetime: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 00 00 : boots=22, time=8152 SNMPv2-MIB::sysDescr.0 = STRING: Linux it0 3.14-1-amd64 #1 SMP Debian 3.14.4-1 (2014-05-13) x86_64 |
Comment by Andris Zeila [ 2014 Jun 27 ] |
What I meant if we decide to use current request sending/receiving implementation (basically snmp_synch_response function) then we are unable to catch usmStatsNotInTimeWindows response. All we get is timeout error with NULL response. So if we wanted to drop the engine from cache in this situation, then we would have to remember the engine from the last successful response - hence ZBXNEXT-2352 and engineId caching. |
Comment by Oleksii Zagorskyi [ 2014 Jul 05 ] |
I had a chance to check one production environment and want to share some details, not are directly related to current issue. I took snmp dump for 10 minutes and there were 5 usmStatsNotInTimeWindows responses. In Wireshark I filtered by "(snmp.name == 1.3.6.1.2.1.1.3.0) && (snmp.data == 2)" (responses with sysUpTime OID) and compared EngineTime and sysUpTime which logically should be the same. Another Fortinet device EngineTime 30280445, sysUpTime 33948713 - difference in 12% ! Yet another Fortinet device has EngineBoot = 0. Not sure it's increased after reboot, so it can be issue's case b) Conclusion - Fortinet devices internally store/calculate EngineTime with very poor precision, EngineTime is always less than sysUpTime ! If compare with Cisco ... Checked another environment: Returning for the Fortinet behavior - it cannot be the same for Dlink above, because: |
Comment by Oleksii Zagorskyi [ 2014 Jul 05 ] |
After 2 days (at server's 12:34:57.8) I've checked the Fortinet device which was with EngineTime=30280445, sysUpTime=33948713 (difference in 12% !). If calculate difference: EngineTime diff - 168833 sysUpTime diff - 170955 server's time diff - 170959 I'd expect the same 12% difference, but it's only 1.2%, heh ... |
Comment by Kay Avila [ 2015 Jan 07 ] |
I've seen something recently that results in the same behavior, but may have a different cause. I've noticed Cisco ASA devices function queried via SNMPv3 function well for a while, and then start intermittently failing to be detected as online by Zabbix, even without changes to the configuration (including no new devices being added). One ASA device started flapping in the logs about 12:16 am on the 4th - 32460:20150104:000730.498 executing housekeeper 32460:20150104:000747.895 housekeeper [deleted 1872 hist/trends, 0 items, 0 events, 0 sessions, 0 alarms, 0 audit items in 17.358250 sec, idle 1 hour(s)] 32440:20150104:001611.197 SNMP agent item [removed] on host "asa" failed: first network error, wait for 15 seconds 32443:20150104:001626.847 resuming SNMP agent checks on host "asa": connection restored 32439:20150104:001711.209 SNMP agent item [removed] on host "asa" failed: first network error, wait for 15 seconds 32443:20150104:001726.993 resuming SNMP agent checks on host "asa": connection restored 32438:20150104:001811.355 SNMP agent item [removed] on host "asa" failed: first network error, wait for 15 seconds 32443:20150104:001826.148 resuming SNMP agent checks on host "asa": connection restored When I started troubleshooting with packet captures on the 6th, I see get-requests and get-next-requests succeeding with get-responses. These packets from the Zabbix server have the correct snmpEngineBoots and snmpEngineTime . However, occasionally (for a different MIB and presumably a different polling process), the snmpEngineTime is 4294952 seconds ahead of the correct value, and the device will reply with a report of 1.3.6.1.6.3.15.1.1.2.0 (as mentioned above). Looking at the current reported snmpEngineTime and using date --date='$x seconds ago', I get a value right around 00:15 on 1/4, which is when snmpEngineTime presumably either rolled over or the SNMP engine was reinitialized for some reason. When this happened, it also presumably incremented snmpEngineBoots from 3 to 4, but I can't confirm that. I do see the correct value of 4 for snmpEngineBoots in both the correct and incorrect requests, though. The Zabbix server then marks them as down rather than not supported, as discussed above, and so I get very little (if any) data back from the device until I go in and restart the Zabbix service, solving the issue temporarily. I've seen this behavior on quite a few ASA devices, but haven't taken the time to dig into it. Hope this helps add another piece to the puzzle. I'd love to see this issue get addressed. |
Comment by Oleksii Zagorskyi [ 2015 Jan 07 ] |
Kay, thanks for your comment. Not every user wants to deep into such details, so I want to say thank you Note that after a device has rolled over snmpEngineTime and has increased snmpEngineBoots and zabbix server tries to get snmp value from the device - zabbix server (particular process) will update its own snmpEngineTime and snmpEngineBoots by new, correct values during the same snmp session. And value will be successfully received w/o any errors in zabbix server log. So not everything is correct in your assumptions. You mentioned get-next-requests, so there could be LLD or dynamic indexes snmp items. But that's not so important. Remember that you have to mention zabbix server version in such tests. If the issue is critical for you and you cannot figure out what's wrong and how to resolve it, you may consider http://www.zabbix.com/services.php |
Comment by Kay Avila [ 2015 Jan 08 ] |
Thanks for the response, especially so quickly. Your screenshot makes a lot of sense. In my case, it's like some of the processes rolled around correctly, but others did not, and those (that?) one(s) cause the server to be marked down. Let me show you what I'm seeing: These just loop repeatedly between the queries with the correct engine-time and the ones with the invalid time. And this started at the same time as it rolled over on the Cisco device. Yes, the get-next-requests are expected, since I'm using a lot of LLD (which I find invaluable, as a side note). |
Comment by Oleksii Zagorskyi [ 2015 Jan 08 ] |
Try to snmpget that ASA device with "-Dlcd" options (you can search these options on this page); In such a way make sure that ASA's snmpEngineTime and snmpEngineBoots were changed correctly for all "edge" cases. Don't mix it with manager's (snmpget's) values. |
Comment by Kay Avila [ 2015 Jan 14 ] |
Sorry for the slow response on this one. This is what I see for snmpget. The engineIDs all match, but some show boots/time and some do not: lcd_set_enginetime: engineID <snipped> : boots=0, time=0 lcd_set_enginetime: engineID <snipped> : boots=0, time=0 lcd_get_enginetime: engineID <snipped> : boots=0, time=0 lcd_set_enginetime: engineID <snipped> : boots=0, time=0 lcd_get_enginetime_ex: engineID <snipped> : boots=0, time=0 lcd_set_enginetime: engineID <snipped> : boots=9, time=106039 lcd_get_enginetime: engineID <snipped> : boots=9, time=106039 lcd_set_enginetime: engineID <snipped> : boots=0, time=0 lcd_get_enginetime_ex: engineID <snipped> : boots=9, time=106039 lcd_set_enginetime: engineID <snipped> : boots=9, time=106039 iso.3.6.1.2.1.1.1.0 = STRING: "Cisco Adaptive Security Appliance Version <snipped>" Both a hard and soft reset result in the boot number incrementing by one and the time starting over at zero. All the engineIDs matched (except the manager one that I removed from the output, of course). I have noticed that removing the configuration for the snmp server on an ASA will result in setting number of boots and time back to zero. That didn't occur before the failure in this case, however. |
Comment by Oleksii Zagorskyi [ 2015 Jan 16 ] |
Kay, I cannot help you anymore, sorry. |
Comment by Oleksii Zagorskyi [ 2015 Jan 16 ] |
After -- a success get: 10461:20150116:004517.606 zbx_snmp_get_values() snmp_synch_response() status:0 s_snmp_errno:0 errstat:0 mapping_num:1 -- real timeout: 11132:20150116:012123.904 zbx_snmp_get_values() snmp_synch_response() status:1 s_snmp_errno:-24 errstat:-1 mapping_num:1 -- usmStatsNotInTimeWindows report: 10461:20150116:005523.184 zbx_snmp_get_values() snmp_synch_response() status:2 s_snmp_errno:-24 errstat:-1 mapping_num:1 I hoped that newly added "s_snmp_errno:-24 " will provide something interesting. #define SNMPERR_TIMEOUT (-24) Not sure, maybe it's still could be useful ? ... |
Comment by Kay Avila [ 2015 Jan 20 ] |
Hi Oleksiy, I'm not looking for tech support. This seems related to the behavior in the bug here and I'm trying to help provide an example. You still think it's a device issue rather than Zabbix? |
Comment by Oleksii Zagorskyi [ 2015 Jan 20 ] |
Kay, what is "wrong" in zabbix and how it would be better in zabbix - I described initially in this issue. It's not possible to answer on your last question - who is bad: zabbix or the device. I don't think we need to continue such discussion here. |
Comment by Vadim Nesterov [ 2015 Sep 21 ] |
Kay, have you solved your problem? Because I have the same with zabbix 2.4 devices become unavailable and I don't know how to fix this. |
Comment by Vadim Nesterov [ 2015 Oct 04 ] |
@Oleksiy Zagorskyi, we use zabbix 2.4.6 with net-snmp-libs-5.7.2 X and we cann't poll data from ARISTA, JUNIPER, EDGECORE over snmp v3. And this problem is not solved yet. |
Comment by Oleksii Zagorskyi [ 2015 Oct 04 ] |
Vadim, your question sound more like a support request, see http://zabbix.org/wiki/Getting_help |
Comment by Vadim Nesterov [ 2015 Oct 04 ] |
Oleksiy, how my question can sound like a support request if I have problems with 3 type of network devices? Have you tested Zabbix with net-snmp-libs-5.7? So if I say that I have a problem with sbnmp v3 aes128 checks, hosts become unavailable, I think it is a bug. |
Comment by Oleksii Zagorskyi [ 2015 Oct 05 ] |
net-snmp v5.7 has been mentioned as working with zabbix 3 years ago in If you are sure zabbix has some bug, please create new bug report, following these rules http://zabbix.org/wiki/Docs/bug_reporting_guidelines. |
Comment by Juliana Oliveira Martins [ 2016 Jun 08 ] |
Good morning, I noticed the same problem with CheckPoint security equipment. |
Comment by Oleksii Zagorskyi [ 2017 Feb 08 ] |
I have a case where Palo Alto Networks PA-7000 series firewall always had snmpEngineBoots=1 after device reboot. |
Comment by richlv [ 2017 Mar 28 ] |
could |
Comment by Oleksii Zagorskyi [ 2017 Mar 28 ] |
Rich, will take a look and reply there. |
Comment by Oleksii Zagorskyi [ 2017 Apr 18 ] |
In the |
Comment by Janne Korkkula [ 2017 May 02 ] |
As Re. With sufficient criteria met (should the Zabbix hostid also be part of the mix?), disabling the host/hosts and providing a meaningful error message in the (level 3 or less) server/proxy logs - preferably with a list of IP's with conflicting engineID's - and also an appropritate hint in the UI tooltip for the disabled host/hosts in the frontend, would be most helpful and might save a lot of time.. |
Comment by Oleksii Zagorskyi [ 2018 Apr 20 ] |
|
Comment by Oleksii Zagorskyi [ 2019 Sep 25 ] |
Got confirmation that "Cisco UCS server" always returns boots=1. |
Comment by Oleksii Zagorskyi [ 2019 Oct 05 ] |
Just for a record, as it's a dirty hack. I don't recommend to do that. |
Comment by Andrei Voinovich [ 2020 Apr 27 ] |
The deepest investigation of the issue I have ever found! And still no solution, that's sad. Exeoriencing the same problems with tons of Cisco Nexus devices. |
Comment by Andrei Voinovich [ 2020 Apr 29 ] |
We found workaround and implemented it in a few minutes:
After these steps no more graph gaps and "network" failures on logs. |
Comment by Oleksii Zagorskyi [ 2020 Apr 29 ] |
That sounds more like a fun |
Comment by Andrei Voinovich [ 2020 Apr 30 ] |
I'd rather say it sounds like sad but true, we have mix of problems:
But we need proactive monitoring 24x7x365, so any solution is better than nothing, more over proxies offload main server which is nice bonus. Rolling back to SNMPv2 is not a solution due to security conerns. |
Comment by Janne Korkkula [ 2020 Apr 30 ] |
Fixing engineID's did it for us, no problems since. Good luck with getting a support contract for 2480 proxies with that workaround... Zabbix WILL appreciate. |
Comment by Andrei Voinovich [ 2020 Jun 02 ] |
Still digging into the problem, looks like devices respond with msgAuthoritativeEngineBoots and msgAuthoritativeEngineTime zeroed due to EngineID discovery: https://tools.ietf.org/html/rfc5343 3.2. EngineID Discovery Discovery of the snmpEngineID is done by sending a Read Class So it means at least that Zabbix should not discover snmpEngineID very often. However I did not find how in this step should respond the device itself in fields msgAuthoritativeEngineBoots and msgAuthoritativeEngineTime. After looking at packet dumps more thoroughly it looks that each time engineID negotiated, device responds with zeroed time fields (not sure whether it should or should not do it). At the same time looking at manual snmpbulkwalk dump shows that only respond to engineID results in zeroed fields, while subsequent sub-queries responded with normal time fields. |
Comment by Andrei Voinovich [ 2020 Jun 02 ] |
Janne Korkkula, how did you fix engineIDs? If you mean duplicate IDs for different devices - we do not have ones. I also thought about setting engineIDs for each device in Zabbix, but did not find the way to do it. |
Comment by Oleksii Zagorskyi [ 2020 Jun 02 ] |
Andrei, Zabbix server does not do walk, do not compare with it. Please, don't mix here 2 cases - correct EngineID learning and usmStatsNotInTimeWindows report response - they are different. |
Comment by Andrei Voinovich [ 2020 Jun 03 ] |
Oleksii, ok, lets look at packet dump: This has been captured on Zabbix server. Bulk requests are turned on for the host, but Zabbix does not seem to do them, also it does not seem that Zabbix cached engineID, otherwise it would reuse it, but it discovers engineID on every request. Again, I am not telling that device sets correct time fields, I am telling that Zabbix should cache engineID, while it does not. |
Comment by Oleksii Zagorskyi [ 2020 Jun 03 ] |
Don't think that things are so simple. Remember that zabbix is a multi process application. As for caching - I've created ZBXNEXT-2352 quite long time ago. |
Comment by Andrei Voinovich [ 2020 Jun 03 ] |
I know that Zabbix is multiprocess and that is why we deployed additional proxies and configured snmp pollers to start only 1 on each proxy, which solved the preblem (in fact masked it). I wo believe in your troublesoohing skills, but I am here not to discuss them - I want to get the truth - who is guilty and what to do? In parallel I am talking with network vendor support. |
Comment by thomas [ 2022 Jun 01 ] |
Hello Zabbix team For your information, issue is seen on Cisco Nexus 7000 platform after an upgrade. Problem is resolved after rebooting Zabbix server (all-in-one installation). Any chance to see a different net-snmp library usage in future Zabbix release ? Best regards, |
Comment by Oleksii Zagorskyi [ 2022 Jun 02 ] |
Not sure there is any other library which provides snmp support. |
Comment by thomas [ 2022 Jun 02 ] |
Thanks a lot Oleksii ! I didn't find |
Comment by Dmitrijs Goloscapovs [ 2024 Mar 25 ] |
In 7.0+ unreachable pollers are not used for new snmp checks (walk[... / get[...), this could help with this issue ZBXNEXT-8460. |
Comment by Vladislavs Sokurenko [ 2024 Mar 25 ] |
Closing this as unreachable pollers are no longer used with walk and get as per above comment in |