[ZBX-9123] Some SNMP hosts don't become available after being unavailable till zabbix server restart Created: 2014 Dec 08 Updated: 2019 Dec 04 Resolved: 2017 Apr 12 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 2.2.7 |
Fix Version/s: | None |
Type: | Incident report | Priority: | Major |
Reporter: | Ivan Prokudin | Assignee: | Unassigned |
Resolution: | Duplicate | Votes: | 2 |
Labels: | snmpv3, unavailable | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Debian 7 wheezy amd64, snmp v3 thru internet |
Issue Links: |
|
Description |
I monitor network devices thru snmpv3 over the internet. I use authPriv, aes\des encryption and SHA authentication. When power is lost or internet link to my network is down, the snmp devices get unavailable in zabbix. After connection is restored sometimes some of the devices don't become available till zabbix server restart. I can simulate this specially cause it doesn't repeat constantly. When it happens I get: and for some devices it's the last message till zabbix server restart. After restarting zabbix server I get: |
Comments |
Comment by Aleksandrs Saveljevs [ 2014 Dec 08 ] |
When a device becomes unavailable, it is then monitored by unreachable pollers. What is the value of StartPollersUnreachable in your configuration file? |
Comment by Ivan Prokudin [ 2014 Dec 08 ] |
It's 5. I know about this setting so I've tried to change from default of It doesn't seem to fix the problem. |
Comment by Aleksandrs Saveljevs [ 2014 Dec 09 ] |
Do the items get polled after host becomes unavailable? That is, does tcpdump show any SNMP traffic attempt between Zabbix and these devices? When the network is down, all devices affected by this network outage become unavailable. When the network is restored, some devices become available again (with all items being polled on these devices) and some devices do not become available (no items get update information). You only observe this problem with SNMPv3 hosts and no other hosts (e.g., SNMPv2). Is this understanding correct? On the affected devices, are there only SNMPv3 items? If you add ICMP ping items to these hosts, for instance, are they affected as well? |
Comment by Ivan Prokudin [ 2014 Dec 09 ] |
> Do the items get polled after host becomes unavailable? That is, does tcpdump show any SNMP traffic attempt between Zabbix and these I'll answer as soon as some devices become not available again and bug will be reproduced. > You only observe this problem with SNMPv3 hosts and no other hosts (e.g., SNMPv2). Is this understanding correct? Talking about snmp - I just have no not SNMPv3 of all snmp devices cause I monitor them thru internet and I need encryption and authorization. > On the affected devices, are there only SNMPv3 items? If you add ICMP ping items to these hosts, for instance, are they affected as well? I have ICMP ping item on one of devices and it's working well even then snmpv3 tells that devices is unavailable. By the way if I try snmpwalk\snmpget when zabbix says device is unavailable it succesfully retrieve information. |
Comment by Ivan Prokudin [ 2014 Dec 10 ] |
I'm not sure if it's related to this bug, but I just had such situation: |
Comment by Aleksandrs Saveljevs [ 2014 Dec 11 ] |
A similar issue with items not being polled was recently reported as Debugging the original and the last problem that you described, it would be a bit easier with DebugLevel=4 logs. Note that since Zabbix 2.4.0 log level can be changed at runtime - see http://blog.zabbix.com/zabbix-2-4-features-part-6-runtime-loglevel-changing/3653/ . In this case, it would make sense to increase logging level for pollers, unreachable pollers, and history syncers (because they evaluate triggers). If you wish to send me anything, feel free to do so at <my-first-name>.<my-last-name>@zabbix.com. |
Comment by Jonathan Rioux [ 2016 May 31 ] |
I have the exact same issue right now with 3.0.2 I use SNMPv3 + ICMP on a firewall. Whenever the firewall becomes unavailable (after the UnreachablePeriod), it wont ever become available again until I restart the zabbix server. The firewall suffers from a non-stop reboot issue that lasts for an hour or so. But still, once the firewall is back online for good, I manually checked the snmpwalk and icmp works fine to the firewall, but Zabbix wont see the firewall as available again until the zabbix server is restarted, even though zabbix will detect the firewall is online using ICMP, it wont get snmp data. So in short, Zabbix wont tell me the firewall has been rebooted (using sysUptime), until I restart the zabbix server. Here is the log: 955:20160529:063806.194 SNMP agent item "sysUpTime" on host "firewall" failed: first network error, wait for 15 seconds 959:20160529:063852.804 SNMP agent item "ifOutErrors[ethernet1/5]" on host "firewall" failed: another network error, wait for 15 seconds 960:20160529:063937.494 temporarily disabling SNMP agent checks on host "firewall": host unavailable |
Comment by Aleksandrs Saveljevs [ 2016 Jun 01 ] |
If you enable DebugLevel=4 or tcpdump, can you see Zabbix trying to do something about these SNMP items after firewall has rebooted? In other words, is the problem that Zabbix does not even attempt to check these items or it somehow fails when trying to? |
Comment by Jonathan Rioux [ 2016 Jun 01 ] |
Last night, there was a 10 minutes power outage (touching only a firewall, not the zabbix server) and zabbix has set the firewall as unavailable. Since then, the snmp on that firewall wouldnt work, and still doesnt work this morning (ping works though), so I decided to troubleshoot it. Note that this time its a different firewall, not the same as my first post, so I guess its an issue with any sort of snmp device. I did a "log_level_increase" (debuglevel 4) during runtime and I got those logs: logfile with debuglevel=4 32684:20160601:112942.245 In get_values_snmp() host:'192.168.1.1' addr:'192.168.1.1' num:1 32684:20160601:112942.305 SNMPv3 [[email protected]:161] 32684:20160601:113012.346 getting SNMP values failed: Timeout while connecting to "192.168.1.1:161". Then, from the same server (zabbix server), I did a snmpwalk to confirm the snmp is working on the firewall: snmpwalk result root@xxxx:/root# snmpwalk -v3 -a SHA -A xxxx -x DES -X xxxx -u zabbix -l authPriv 192.168.1.1 SNMPv2-MIB::sysUpTime.0 SNMPv2-MIB::sysUpTime.0 = Timeticks: (5653400) 15:42:14.00 So the snmp works fine, the firewall is not the problem. Then I did a tcpdump while the zabbix server tried to snmp to the firewall, here is the output: tcpdump output
root@xxxxx:/root# tcpdump dst 192.168.1.1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:37:23.266891 IP xxxxx > 192.168.1.1: ICMP echo request, id 10641, seq 1, length 64
11:37:24.266898 IP xxxxx > 192.168.1.1: ICMP echo request, id 10641, seq 7, length 64
11:37:25.267884 IP xxxxx > 192.168.1.1: ICMP echo request, id 10641, seq 13, length 64
11:37:27.391021 IP xxxxx.54714 > 192.168.1.1.snmp: F=r U= E= C= GetRequest(14)
11:37:27.404469 IP xxxxx.54714 > 192.168.1.1.snmp: F=apr U=zabbix [!scoped PDU]27_49_55_1f_57_d0_b7_36_95_6c_e2_1d_a8_05_0e_74_ef_41_fb_30_44_6c_bc_b2_e3_db_c4_a5_9f_80_91_32_a3_91_7e_ec_47_44_6d_71_2c_1e_c9_00_5e_1e_34_a2_8d_0c_b3_6c_24_9d_14_e7_cc_66_1d_ad_ee_b3_6d_37
11:37:42.395988 IP xxxxx.34127 > 192.168.1.1.snmp: F=r U= E= C= GetRequest(14)
11:37:42.409490 IP xxxxx.34127 > 192.168.1.1.snmp: F=apr U=zabbix [!scoped PDU]5a_d0_78_7f_16_7b_63_f1_1b_7f_e7_c6_b0_49_37_60_2f_51_02_bd_9a_be_89_be_65_b2_e9_04_9f_f8_de_f5_fa_d1_d0_21_2a_3e_3c_86_f9_7d_2d_ff_42_41_fd_b1_ed_32_01_3a_08_d5_d0_43_11_d3_5c_b8_d0_11_cd_10
11:37:42.414761 IP xxxxx.54714 > 192.168.1.1.snmp: F=apr U=zabbix [!scoped PDU]7d_b8_10_1d_9e_97_ab_91_85_90_17_db_6d_a1_03_c2_12_88_23_bb_2d_b9_31_1e_d8_23_41_78_41_2a_9b_34_0f_06_ef_77_8c_49_7e_f7_7b_50_a8_2a_aa_a3_e1_d8_c3_c9_13_df_6f_78_9e_44_61_a0_51_ca_3e_30_ed_a3
11:37:57.409912 IP xxxxx.49797 > 192.168.1.1.snmp: F=r U= E= C= GetRequest(14)
11:37:57.416074 IP xxxxx.34127 > 192.168.1.1.snmp: F=apr U=zabbix [!scoped PDU]c0_7c_d8_6e_b3_e6_f4_00_81_0d_f7_d7_15_d2_db_7a_c0_ed_71_f9_80_48_14_1c_84_6d_98_eb_e5_ff_98_75_08_9c_a2_15_3d_b7_b6_44_1b_ce_76_5d_3b_14_73_9c_db_d5_ce_2c_1e_28_49_9c_6b_2d_2b_60_e2_2a_ae_d0
11:37:57.423371 IP xxxxx.49797 > 192.168.1.1.snmp: F=apr U=zabbix [!scoped PDU]5e_1a_a5_ee_72_e4_42_56_74_54_7a_c2_27_ee_ad_9e_5d_11_e0_d9_a9_7c_72_a2_8c_c7_f5_6d_a8_75_c8_ba_77_b8_d0_8e_f7_de_51_0a_58_92_c7_2f_b0_95_b9_f2_be_c6_18_5b_6f_f0_c6_41_47_68_9a_d0_d8_15_df_6a
11:38:12.437055 IP xxxxx.49797 > 192.168.1.1.snmp: F=apr U=zabbix [!scoped PDU]51_af_14_85_99_df_e2_37_70_df_fc_cb_4d_b1_f0_6f_80_83_f6_33_1d_a9_c6_da_90_3a_60_b4_11_e7_10_8e_03_ba_d7_68_88_99_2d_83_94_8d_de_43_c7_bf_70_ad_69_b4_3d_0c_21_78_6b_27_7c_8b_88_47_86_05_e8_00
11:38:23.320582 IP xxxxx > 192.168.1.1: ICMP echo request, id 10905, seq 1, length 64
11:38:24.320584 IP xxxxx > 192.168.1.1: ICMP echo request, id 10905, seq 8, length 64
11:38:25.321515 IP xxxxx > 192.168.1.1: ICMP echo request, id 10905, seq 15, length 64
Then as soon as I restart the zabbix server daemon, the snmp is working again and I get the email from zabbix saying the firewall has been restarted... Screenshot from zabbix hosts page showing the firewall snmp times out: What are the next steps? |
Comment by Aleksandrs Saveljevs [ 2016 Jun 20 ] |
I have tried reproducing the problem with an SNMPv3 device by blocking and unblocking packets with iptables (with an optional device restart between blocking and unblocking), but the issue did not manifest. The tcpdump you provided shows packets traveling in one direction only. Did they travel in the other direction as well? Could you please also post the tcpdump in binary form (both during the problem and after the problem), so that we can inspect the packets and their differences more closely in Wireshark? |
Comment by Janne Korkkula [ 2016 Sep 05 ] |
We're also experiencing this exact same issue (temp.disable == gone for good) with 2.2.14 and SNMPv3 core switches, ours are monitored via a routed private network. I increased the timeouts to UnreachableDelay=30 and UnreachablePeriod=180, remains to be seen if it helps any, but of course it won't fix the issue itself. |
Comment by richlv [ 2016 Sep 05 ] |
jannek, those parameters are not timeouts. you should probably change them back. |
Comment by Janne Korkkula [ 2016 Sep 06 ] |
True they're not timeouts per se, but increasing UnreachablePeriod should give more time for the clients/network to recover, so at least in theory we should hit this very severe issue less frequently. As for UnreachableDelay of 30, 180/30=6, twice the count of connection attempts vs. the default 45/15, but I returned that value back to 15 since it doesn't matter much. We can live with a 180 vs. 45 s delay in having (all) hosts considered unavailable until this is fixed, but can't have core network components randomly lose all monitoring. |
Comment by richlv [ 2016 Sep 06 ] |
got it. note that there's indication of what could be wrong yet - you'd probably have to debug this yourself, otherwise it is unlikely to be fixed (if there's anything to be fixed at all) |
Comment by Janne Korkkula [ 2016 Sep 07 ] |
Just to keep you up to date, we migrated SNMP monitoring (both v2 and v3) to one dedicated proxy. Reduces load, enables local settings for UnreachableDelay and UnreachablePeriod and makes debugging a bit more convenient. Of course there's nothing to debug right now, that'd be too easy. |
Comment by Ivan Prokudin [ 2017 Jan 26 ] |
Hello! I've mostly resolved the issue by setting different engine-id on my routers (shame on me for missing so simple thing). But very rarely I'm still getting the issue. I've updated my zabbix to 3.2.3, I use mikrotik routers with the latest routeros. As for now I had one router stucked in "host unavailable" situation and I don't restart zabbix server to investigate the situation. I've just started tcpdump to collect traffic between it and zabbix. Sorry, but I can't share the dump publicly, please tell me how and whom can I send the dump privately. Also please tell me what information is needed. Some actual information: Zabbix got no information from the host after this log records: |
Comment by Janne Korkkula [ 2017 Jan 26 ] |
Excellent that you've got the dump running, maybe it'll give some clues. With the dedicated SNMP proxy with tweaked UnreachableDelay and UnreachablePeriod settings everything has worked quite well, but after a Serious Fubar Condition last week the issue surfaced as intermittent SNMPv3 monitoring failures without any apparent reason. Restarting the proxy fixed everything. UnreachablePollers occasionally hit 90-100% busy with these settings: DataSenderFrequency=1 (I just raised StartPollersUnreachable -> 50 (and StartPingers -> 5), let's see what happens.) |
Comment by Ivan Prokudin [ 2017 Feb 03 ] |
Ping for zabbix developers. I have pretty dump of snmp packets for you 2,5MB large. Please pay a little attention here and tell me whom can I send it privately to resolve this issue. |
Comment by Ivan Prokudin [ 2017 Feb 10 ] |
Hello! It's stably happens after devices reboot. May be mikrotik issue (have only their devices with snmp) or not, but I surely need help of zabbix developers. May I have your attention please? |
Comment by Vladimir Dovgopol [ 2017 Mar 27 ] |
Hi team, Some logs: I hit this issue several times a week, if you need additional debug information I can provide it to you. |
Comment by richlv [ 2017 Mar 27 ] |
VladimirDovgopol, when the problem happens, what is printed in the logfile at debuglevel 4 for the unreachable poller process ? |
Comment by Jonathan Rioux [ 2017 Mar 27 ] |
@richlv, I already posted the log content at debuglevel 4 concerning this issue: comment-183260 logfile with debuglevel=4 32684:20160601:112942.245 In get_values_snmp() host:'192.168.1.1' addr:'192.168.1.1' num:1 32684:20160601:112942.305 SNMPv3 [[email protected]:161] 32684:20160601:113012.346 getting SNMP values failed: Timeout while connecting to "192.168.1.1:161". |
Comment by richlv [ 2017 Mar 27 ] |
jorioux, that seems to show a timeout on the device, which seems to be unlikely a problem with zabbix |
Comment by Jonathan Rioux [ 2017 Mar 27 ] |
@richlv, in all respect, the problem is definitely with zabbix. While Zabbix is not able to poll the router with snmp, when I do an snmpwalk from the zabbix server, it successfully polls the information! How can you explain that? |
Comment by Joerg Schwarzwaelder [ 2017 Mar 27 ] |
I have exactly the same issue. |
Comment by Ivan Prokudin [ 2017 Mar 27 ] |
I would like to ask all users that have the same issue to post which hardware they are monitoring by SNMP when the issue happens. As for me I have only mikrotik hardware. And also please vote for the bug cause zabbix team will not fix it for free if it's not interesting for many users. |
Comment by Joerg Schwarzwaelder [ 2017 Mar 27 ] |
we have this issue at least with Citrix Netscalers. |
Comment by Jonathan Rioux [ 2017 Mar 27 ] |
for me the issue is with a Juniper SSG140. |
Comment by Ivan Prokudin [ 2017 Mar 27 ] |
Do you use zabbix proxy or server for sending SNMP requests? What OS does it run on? Do it happens only with SNMPv3 or also the earlier versions? As for me zabbix seems to have some issues with SNMPv3. |
Comment by Joerg Schwarzwaelder [ 2017 Mar 27 ] |
straight from the Zabbix server running CentOS. We are using SNMPv3. |
Comment by Ivan Prokudin [ 2017 Mar 27 ] |
Please tell which exactly centos version do you use? |
Comment by Joerg Schwarzwaelder [ 2017 Mar 27 ] |
CentOS 6.8 |
Comment by Vladimir Dovgopol [ 2017 Mar 27 ] |
Firewals WatchGuard XTM equipment, different models (22, 33, 510, 525), we see this issue on all models. Protocol SNMP v3. |
Comment by Ivan Prokudin [ 2017 Mar 27 ] |
So it's seems that it doesn't depend on hardware being monitored, nor linux distribution, nor net-snmp version. But happens only with SNMPv3. Do you all have different engine-id set up on your devices? It's mandatory for SNMPv3. |
Comment by Vladimir Dovgopol [ 2017 Mar 28 ] |
Yes, I have different engine-id on my devices. covs-sys-zabb1:/tmp# snmpwalk -v3 ..... 1.3.6.1.6.3.10.2.1.1.0 |
Comment by Joerg Schwarzwaelder [ 2017 Mar 28 ] |
same here: |
Comment by Janne Korkkula [ 2017 Mar 28 ] |
|
Comment by richlv [ 2017 Mar 28 ] |
could this be the same as |
Comment by Ivan Prokudin [ 2017 Mar 28 ] |
richlv, can't answer your question cause have no skills to understand |
Comment by Oleksii Zagorskyi [ 2017 Mar 28 ] |
Hi all, snmV3 troubleshooting lover here If someone wants to send sensitive info, like tcpdump, to zabbix team, you can send it to support at zabbix dot com, I'll check it myself. For devices which have the mysterious issue, please execute NOW a test described in following comment and write down your result. Ivan, why in posted debug log we don't see a line with "snmp_synch_response" text ? |
Comment by Ivan Prokudin [ 2017 Apr 01 ] |
Oleksiy, hello! Sorry, missed up that I need to execute snmpget command before the problem. So I've executed it while the problem exists and after it had gone away (zabbix-server restarted). # snmpget -v 3 -a SHA -A apass -l authPriv -u private -x AES -X epass hostname .1.3.6.1.2.1.1.1.0 -Dlcd registered debug token lcd, 1 lcd_set_enginetime: engineID 80 00 1F 88 80 6E ED 4D 4E ED A6 DF 58 00 00 00 00 : boots=1, time=0 lcd_set_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0 lcd_set_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0 lcd_get_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0 lcd_set_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0 lcd_get_enginetime_ex: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0 lcd_set_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=3133 SNMPv2-MIB::sysDescr.0 = STRING: RouterOS RB750Gr3 When problem had gone away: snmpget -v 3 -a SHA -A apass -l authPriv -u private -x AES -X epass host .1.3.6.1.2.1.1.1.0 -Dlcd registered debug token lcd, 1 lcd_set_enginetime: engineID 80 00 1F 88 80 6B 5A F7 5F 59 A8 DF 58 00 00 00 00 : boots=1, time=0 lcd_set_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0 lcd_set_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0 lcd_get_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0 lcd_set_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0 lcd_get_enginetime_ex: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0 lcd_set_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=3497 SNMPv2-MIB::sysDescr.0 = STRING: RouterOS RB750Gr3 I also will send an email to you on support at zabbix with two dumps - while problem exists and while zabbix server is restarting. And I haven't posted any debug logs cause it's impossible for me to create some of them because of high load of zabbix. And I can't repeat the issus on any test systems. |
Comment by Oleksii Zagorskyi [ 2017 Apr 01 ] |
I was right - it's the case b) This issue may be closed as duplicate. |
Comment by Ivan Prokudin [ 2017 Apr 01 ] |
Oleksiy, got it. Seems to be clear. But strange that so many different hardware (we have at least 3 people with different hardware in the topic) fails with it. But RFC make the issue clear. So I've just written to mikrotik with links to the bug and to |
Comment by Janne Korkkula [ 2017 Apr 03 ] |
Please reconsider carefully before closing this issue and/or Problem Child, no issues at the moment: registered debug token lcd, 1 lcd_set_enginetime: engineID 80 00 1F 88 80 27 94 4D 10 D9 08 E2 58 00 00 00 00 : boots=1, time=0 lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 47 9C BA 00 00 00 01 : boots=0, time=0 lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 47 9C BA 00 00 00 01 : boots=1, time=7278801 lcd_get_enginetime: engineID 80 00 63 A2 80 2C 23 3A 47 9C BA 00 00 00 01 : boots=1, time=7278801 lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 47 9C BA 00 00 00 01 : boots=0, time=0 lcd_get_enginetime_ex: engineID 80 00 63 A2 80 2C 23 3A 47 9C BA 00 00 00 01 : boots=0, time=0 lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 47 9C BA 00 00 00 01 : boots=1, time=7278801 SNMPv2-MIB::sysDescr.0 = STRING: HPE Comware Platform Software, Software Version 7.1.045, Release 2422P03 HPE 5900AF-48XG-4QSFP+ Switch Copyright (c) 2010-2016 Hewlett Packard Enterprise Development LP The other Usual Suspect, no issues now: registered debug token lcd, 1 lcd_set_enginetime: engineID 80 00 1F 88 80 9C 62 9E 4F E5 08 E2 58 00 00 00 00 : boots=1, time=0 lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=0, time=0 lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=5, time=25815091 lcd_get_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=5, time=25815091 lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=0, time=0 lcd_get_enginetime_ex: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=0, time=0 lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=5, time=25815091 SNMPv2-MIB::sysDescr.0 = STRING: HPE Comware Platform Software, Software Version 7.1.045, Release 2422P01 HPE 5900AF-48XG-4QSFP+ Switch |
Comment by Oleksii Zagorskyi [ 2017 Apr 03 ] |
Hi Janne ! We had many cases, which proved that those devices behave incorrectly regarding SNMPv3. Your examples: 1st with boots=1 (lines where time=7278801 is a realistic uptime in seconds - 84 days) - will the counter grow after the device reboot? 2nd example - looks healthy because boots=5, so we may suppose it was less in the past and previous numbers should be preserver on some flash/etc during device reboots. I do not hurry up by closing this as duplicate, but I don't see yet any prove that it's not a duplicate. |
Comment by Ivan Prokudin [ 2017 Apr 10 ] |
Oleksiy, I've gotten respond from mikrotik. They told me so:
Their words seem to be also logical. The problem is that you and mikrotik understand "SNMP Engine" differently. You think that "SNMP Engine == whole device" and they understand it as "SNMP daemon(service) on the device". And as for me they seems more logical then you. What for any device should count it's regular reboots? By the way, I've just understand that issue can happen after, for example link to SNMP devices goes down and up after some time. How does this situation depens on "boots" value? |
Comment by Oleksii Zagorskyi [ 2017 Apr 10 ] |
Ivan, they are correct in 1st part, but that's not related to our case, because if agent's "snmpEngine" has been changed to an unique value - it will be anew for manager, so any snmpEngineBoots+snmpEngineTime will be accepted by the manager and stored into library's "enginetime_struct" structure, to be reused next times. But, their statement if everything is fine, value stays at 0 sounds incorrectly in the context. Any application, which would be running as a daemon and would use the shared library - would be related here the same way. I''m agree that technically "SNMP Engine" it's sort of a daemon, not device. But to simplify discussion we just call it as a device (agent role). I'd consider myself as too brave to argue with Mikrotik guys, but not in current case, because I've spend loooooot of tome for the topic and I'm pretty sure in my understanding of the RFC and correspondingly behavior of the libnetsnmp (not zabbix), which is correct as for, again, the RFC! And last - link goes down/up should cause the discussed issue if monitored snmp device behaves according to the RFC. |
Comment by Ivan Prokudin [ 2017 Apr 10 ] |
Oleksiy, would you be so kind to communicate to mikrotik directly? It think it makes progress slower to send your answers to them and their answers here? I will continue being a such type of transmitter but tell me if you can communicate directly to prove your position. |
Comment by Oleksii Zagorskyi [ 2017 Apr 10 ] |
Note - I did not test any their device(s), so I don't say that their device(s) behave incorrectly. Hmm, honestly speaking I don't see a reason I need to communicate them and prove something. |
Comment by Ivan Prokudin [ 2017 Apr 11 ] |
Mikrotik guys answered me in two messages:
Both vendors don't wanna connect each other, so only users suffer. OK, it's not very hard to forward messages here and to mikrotik. Can you shortly summarize what should I answer them? Especially on the last question? And the second answer from Mikro
/system reboot is the command that reboots mikrotik router. But as I understand you, Oleksiy, boots should be incremented on every boot? For example if router was rebooted because of power loss? Am I right? |
Comment by richlv [ 2017 Apr 11 ] |
iprok, please note that zalex_ua has also referenced the industry-standard netsnmp implementation. mikrotik snmp implementation is not compatible with any vendor that would be using libnetsnmp. |
Comment by Ivan Prokudin [ 2017 Apr 11 ] |
richlv. I don't try to tell you that zabbix developers are wrong. But I will have not enough skills not to debug the problem nor resolve it nor tell mikrotik the decision. They seems to be ready to have a discussion with me (just to be honest quicker then zabbix developers - less then couple of years |
Comment by richlv [ 2017 Apr 12 ] |
iprok, oh, not saying that it is bad to push forward with this - and really glad to hear mikrotik is responding to this. |
Comment by Oleksii Zagorskyi [ 2017 Apr 12 ] |
Ivan, let me clarify that our assumption is based on your tests. I did not state myself that Mikrotik devices are doing something wrong as for SNMPv3 proto communication. Well, at this point, midnight here, I disturbed my cousin, who has as RB2011UiAS-2HnD-IN at home ... Enabling SNMP on the Mikrotik router and after ~20 minutes (spent to configure routing etc) we performed our first test: # snmpget -v 3 -a SHA -A apass123 -l authPriv -u private -x AES -X hostname .1.3.6.1.2.1.1.1.0 -Dlcd No log handling enabled - turning on stderr logging registered debug token lcd, 1 lcd_set_enginetime: engineID 80 00 1F 88 80 8A 8F 5F 06 21 46 ED 58 : boots=1, time=0 lcd_set_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0 lcd_set_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0 lcd_get_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0 lcd_set_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0 lcd_get_enginetime_ex: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0 lcd_set_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=1134 SNMPv2-MIB::sysDescr.0 = STRING: RouterOS RB2011UAS-2HnD then we rebooted it (from winbox tool): # snmpget -v 3 -a SHA -A apass123 -l authPriv -u private -x AES -X epass123 hostname .1.3.6.1.2.1.1.1.0 -Dlcd No log handling enabled - turning on stderr logging registered debug token lcd, 1 lcd_set_enginetime: engineID 80 00 1F 88 80 CD 71 B4 27 26 47 ED 58 : boots=1, time=0 lcd_set_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0 lcd_set_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0 lcd_get_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0 lcd_set_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0 lcd_get_enginetime_ex: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0 lcd_set_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=107 SNMPv2-MIB::sysDescr.0 = STRING: RouterOS RB2011UAS-2HnD you can see that after reboot, "boots" stays =0, while "time" corresponds with device's (EngineID's, to be precise technically) uptime. In the
So, snmpEngineBoots must be increased after each EngineID restart, but to simplify things we just say after each device reboot. How snmpd, provided by "net-snmp", ensures that engineBoots will be increased next time for sure? - it increases the engineBoots on startup, updates the run-time conf file and uses this increased counter. # ps aux | grep snmpd; grep engineBoots /var/lib/snmp/snmpd.conf; service snmpd start; grep engineBoots /var/lib/snmp/snmpd.conf; ps aux | grep snmpd engineBoots 372 engineBoots 373 snmp 14407 0.0 0.1 69408 16532 ? S 01:13 0:00 /usr/sbin/snmpd -LS5-0d -Lf /dev/null -u snmp -g snmp -I -smux mteTrigger mteTriggerConf -p /run/snmpd.pid # snmpget -v 3 -a MD5 -A publicV3 -l authPriv -u publicV3 -X publicV3 localhost .1.3.6.1.2.1.1.1.0 -Dlcd registered debug token lcd, 1 lcd_set_enginetime: engineID 80 00 1F 88 80 95 B5 C2 6E 56 94 D8 54 00 00 0000 : boots=400, time=0 lcd_set_enginetime: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 0000 : boots=0, time=0 lcd_set_enginetime: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 0000 : boots=373, time=5 lcd_get_enginetime: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 0000 : boots=373, time=5 lcd_set_enginetime: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 0000 : boots=0, time=0 lcd_get_enginetime_ex: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 0000 : boots=0, time=0 lcd_set_enginetime: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 0000 : boots=373, time=5 SNMPv2-MIB::sysDescr.0 = STRING: Linux it0 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 Answering to your question - snmpEngineBoots must be increased between each EngineID restart, for any reboot reason - soft reset or power lost does not matter. This is probably very last time when I spent so much time for such discussions |
Comment by Oleksii Zagorskyi [ 2017 Apr 12 ] |
I'm closing this one as duplicate of Short discussions could be continued. |
Comment by Ivan Prokudin [ 2017 Apr 12 ] |
Oleksiy, thank you very much. I've emailed mikrotik guys with link to your answer (and also copied it in my mail). You've done a really great job for us (zabbix users with mikrotiks). BTW, I've told not once that I'm ready to give access to everything that is needed to debug the issue, not to waste time to configure mikrotik device (BTW they have Cloud Hosted Router - free image of virtual machine for any hypervizor, no need to look for hardware mikrotik device). But for now it seems that it's no more questions to you. Thank you very much again. You've raised my feelings about zabbix support team to very high level. |
Comment by Janne Korkkula [ 2017 Apr 18 ] |
As I protested earlier, closing this issue off because some home router is proven buggy is/was a bad call... One of our usual suspects, the one with 5 boots and now 313 days of uptime, a HPE 10G switch in one of our datacenters, is currently being considered unavailable by our Zabbix via its dedicated SNMP proxy. All SNMP requests work fine. Here's the relevant bit of the proxy log: 64039:20170415:011340.499 resuming SNMP agent checks on host "srv-irf": connection restored 63983:20170415:011520.852 SNMP agent item "ifAdminStatus[Ten-GigabitEthernet4/0/44]" on host "srv-irf" failed: first network error, wait for 20 seconds 64015:20170415:011610.153 SNMP agent item "ifAdminStatus[Ten-GigabitEthernet4/0/26]" on host "srv-irf" failed: another network error, wait for 20 seconds 64034:20170415:011625.133 SNMP agent item "ifOperStatus[Ten-GigabitEthernet4/0/43]" on host "srv-irf" failed: another network error, wait for 20 seconds 64006:20170415:011640.190 SNMP agent item "ifAdminStatus[Ten-GigabitEthernet2/0/38]" on host "srv-irf" failed: another network error, wait for 20 seconds 64009:20170415:011730.140 temporarily disabling SNMP agent checks on host "srv-irf": host unavailable Note how it says "temporarily"? It stays disabled until the proxy is restarted. And here's the snmpget result, issue still active, ie. no-one has restarted the proxy yet. We have about two hours until it has to be done. registered debug token lcd, 1 lcd_set_enginetime: engineID 80 00 1F 88 80 F9 6E 11 3C 41 EA F5 58 00 00 00 00 : boots=1, time=0 lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=0, time=0 lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=5, time=27117965 lcd_get_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=5, time=27117965 lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=0, time=0 lcd_get_enginetime_ex: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=0, time=0 lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=5, time=27117965 SNMPv2-MIB::sysDescr.0 = STRING: HPE Comware Platform Software, Software Version 7.1.045, Release 2422P01 HPE 5900AF-48XG-4QSFP+ Switch Copyright (c) 2010-2015 Hewlett Packard Enterprise Development LP |
Comment by Oleksii Zagorskyi [ 2017 Apr 18 ] |
Janne, let's not mix your case with reporter's one and consider your case from scratch in a new ZBX. |
Comment by Ivan Prokudin [ 2017 Apr 18 ] |
Janne, to prove that it's not fault of vendor of your hardware first of all you should show output of Regardless all this, I (as topic starter) fully confirm that this ticket can be closed as duplicate of |
Comment by Janne Korkkula [ 2017 Apr 18 ] |
Ivan, our problem children (them two HPE 5900's) are not rebooted frequently, those counts of 1 and 5 are very likely to be true. It takes half an hour just to complete a reboot cycle of the larger of the two. The counter problem is not behind our variant of the same symptom, it must be something else.
|
Comment by Théo Castelo N. de Araújo [ 2018 Jan 31 ] |
Hi everyone, I have the same problem, after Host unavailability, SNMP v2 checks on a router was unavailable, testing inside the server, using snmpwalk, the data returns normally, already in zabbix, it gives timeout error. zabbix server 3.4.1 https://i.imgur.com/8O4NjGD.png Any news about this issue? |
Comment by Ali HBB [ 2019 Dec 04 ] |
The problem is definitely from zabbix, because we have solarwins with same access and ip range to our Cisco switch that shows everything fine but zabbix still consider our Cisco switch as snmp timed out
|