[ZBX-9123] Some SNMP hosts don't become available after being unavailable till zabbix server restart Created: 2014 Dec 08  Updated: 2019 Dec 04  Resolved: 2017 Apr 12

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 2.2.7
Fix Version/s: None

Type: Incident report Priority: Major
Reporter: Ivan Prokudin Assignee: Unassigned
Resolution: Duplicate Votes: 2
Labels: snmpv3, unavailable
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian 7 wheezy amd64, snmp v3 thru internet


Issue Links:
Duplicate
duplicates ZBX-8385 snmpV3 report (response) "usmStatsNot... Closed
is duplicated by ZBX-11806 Some SNMP hosts still don't become av... Closed

 Description   

I monitor network devices thru snmpv3 over the internet. I use authPriv, aes\des encryption and SHA authentication.

When power is lost or internet link to my network is down, the snmp devices get unavailable in zabbix. After connection is restored sometimes some of the devices don't become available till zabbix server restart. I can simulate this specially cause it doesn't repeat constantly.

When it happens I get:
temporarily disabling SNMP agent checks on host "...": host unavailable

and for some devices it's the last message till zabbix server restart. After restarting zabbix server I get:
enabling SNMP agent checks on host "": host became available
very quikly.



 Comments   
Comment by Aleksandrs Saveljevs [ 2014 Dec 08 ]

When a device becomes unavailable, it is then monitored by unreachable pollers. What is the value of StartPollersUnreachable in your configuration file?

Comment by Ivan Prokudin [ 2014 Dec 08 ]

It's 5. I know about this setting so I've tried to change from default of
StartPollersUnreachable=1
to
StartPollersUnreachable=5

It doesn't seem to fix the problem.

Comment by Aleksandrs Saveljevs [ 2014 Dec 09 ]

Do the items get polled after host becomes unavailable? That is, does tcpdump show any SNMP traffic attempt between Zabbix and these devices?

When the network is down, all devices affected by this network outage become unavailable. When the network is restored, some devices become available again (with all items being polled on these devices) and some devices do not become available (no items get update information). You only observe this problem with SNMPv3 hosts and no other hosts (e.g., SNMPv2). Is this understanding correct? On the affected devices, are there only SNMPv3 items? If you add ICMP ping items to these hosts, for instance, are they affected as well?

Comment by Ivan Prokudin [ 2014 Dec 09 ]

> Do the items get polled after host becomes unavailable? That is, does tcpdump show any SNMP traffic attempt between Zabbix and these
devices?

I'll answer as soon as some devices become not available again and bug will be reproduced.

> You only observe this problem with SNMPv3 hosts and no other hosts (e.g., SNMPv2). Is this understanding correct?

Talking about snmp - I just have no not SNMPv3 of all snmp devices cause I monitor them thru internet and I need encryption and authorization.

> On the affected devices, are there only SNMPv3 items? If you add ICMP ping items to these hosts, for instance, are they affected as well?

I have ICMP ping item on one of devices and it's working well even then snmpv3 tells that devices is unavailable. By the way if I try snmpwalk\snmpget when zabbix says device is unavailable it succesfully retrieve information.

Comment by Ivan Prokudin [ 2014 Dec 10 ]

I'm not sure if it's related to this bug, but I just had such situation:
I have trigger sysUpTime.nodata(300)}=1 on sysUpTime.0 item of my devices. One of them was triggered for several hours but the device was available in zabbix. I've dumped using tcpdump traffic for this device for several minutes and then restart zabbix server. And problem was resolved. Do you need this dump? How can I send to you privately cause it contents my snmp passwords?

Comment by Aleksandrs Saveljevs [ 2014 Dec 11 ]

A similar issue with items not being polled was recently reported as ZBX-9016 and has been fixed in 2.4.3rc1. However, there items stopped being polled on hosts that were disabled and then reenabled. If you are not disabling your hosts or your items, then this issue is probably different, but you might wish to try 2.4.3rc1, which was released last week.

Debugging the original and the last problem that you described, it would be a bit easier with DebugLevel=4 logs. Note that since Zabbix 2.4.0 log level can be changed at runtime - see http://blog.zabbix.com/zabbix-2-4-features-part-6-runtime-loglevel-changing/3653/ . In this case, it would make sense to increase logging level for pollers, unreachable pollers, and history syncers (because they evaluate triggers). If you wish to send me anything, feel free to do so at <my-first-name>.<my-last-name>@zabbix.com.

Comment by Jonathan Rioux [ 2016 May 31 ]

I have the exact same issue right now with 3.0.2

I use SNMPv3 + ICMP on a firewall. Whenever the firewall becomes unavailable (after the UnreachablePeriod), it wont ever become available again until I restart the zabbix server. The firewall suffers from a non-stop reboot issue that lasts for an hour or so. But still, once the firewall is back online for good, I manually checked the snmpwalk and icmp works fine to the firewall, but Zabbix wont see the firewall as available again until the zabbix server is restarted, even though zabbix will detect the firewall is online using ICMP, it wont get snmp data. So in short, Zabbix wont tell me the firewall has been rebooted (using sysUptime), until I restart the zabbix server.

Here is the log:

955:20160529:063806.194 SNMP agent item "sysUpTime" on host "firewall" failed: first network error, wait for 15 seconds
959:20160529:063852.804 SNMP agent item "ifOutErrors[ethernet1/5]" on host "firewall" failed: another network error, wait for 15 seconds
960:20160529:063937.494 temporarily disabling SNMP agent checks on host "firewall": host unavailable
Comment by Aleksandrs Saveljevs [ 2016 Jun 01 ]

If you enable DebugLevel=4 or tcpdump, can you see Zabbix trying to do something about these SNMP items after firewall has rebooted? In other words, is the problem that Zabbix does not even attempt to check these items or it somehow fails when trying to?

Comment by Jonathan Rioux [ 2016 Jun 01 ]

Last night, there was a 10 minutes power outage (touching only a firewall, not the zabbix server) and zabbix has set the firewall as unavailable. Since then, the snmp on that firewall wouldnt work, and still doesnt work this morning (ping works though), so I decided to troubleshoot it. Note that this time its a different firewall, not the same as my first post, so I guess its an issue with any sort of snmp device.

I did a "log_level_increase" (debuglevel 4) during runtime and I got those logs:

logfile with debuglevel=4
 32684:20160601:112942.245 In get_values_snmp() host:'192.168.1.1' addr:'192.168.1.1' num:1
 32684:20160601:112942.305 SNMPv3 [[email protected]:161]
 32684:20160601:113012.346 getting SNMP values failed: Timeout while connecting to "192.168.1.1:161".

Then, from the same server (zabbix server), I did a snmpwalk to confirm the snmp is working on the firewall:

snmpwalk result
root@xxxx:/root# snmpwalk -v3 -a SHA -A xxxx -x DES -X xxxx -u zabbix -l authPriv 192.168.1.1 SNMPv2-MIB::sysUpTime.0
SNMPv2-MIB::sysUpTime.0 = Timeticks: (5653400) 15:42:14.00

So the snmp works fine, the firewall is not the problem. Then I did a tcpdump while the zabbix server tried to snmp to the firewall, here is the output:

tcpdump output
root@xxxxx:/root# tcpdump dst 192.168.1.1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:37:23.266891 IP xxxxx > 192.168.1.1: ICMP echo request, id 10641, seq 1, length 64
11:37:24.266898 IP xxxxx  > 192.168.1.1: ICMP echo request, id 10641, seq 7, length 64
11:37:25.267884 IP xxxxx  > 192.168.1.1: ICMP echo request, id 10641, seq 13, length 64
11:37:27.391021 IP xxxxx.54714 > 192.168.1.1.snmp:  F=r U= E=  C= GetRequest(14)
11:37:27.404469 IP xxxxx.54714 > 192.168.1.1.snmp:  F=apr U=zabbix [!scoped PDU]27_49_55_1f_57_d0_b7_36_95_6c_e2_1d_a8_05_0e_74_ef_41_fb_30_44_6c_bc_b2_e3_db_c4_a5_9f_80_91_32_a3_91_7e_ec_47_44_6d_71_2c_1e_c9_00_5e_1e_34_a2_8d_0c_b3_6c_24_9d_14_e7_cc_66_1d_ad_ee_b3_6d_37
11:37:42.395988 IP xxxxx.34127 > 192.168.1.1.snmp:  F=r U= E=  C= GetRequest(14)
11:37:42.409490 IP xxxxx.34127 > 192.168.1.1.snmp:  F=apr U=zabbix [!scoped PDU]5a_d0_78_7f_16_7b_63_f1_1b_7f_e7_c6_b0_49_37_60_2f_51_02_bd_9a_be_89_be_65_b2_e9_04_9f_f8_de_f5_fa_d1_d0_21_2a_3e_3c_86_f9_7d_2d_ff_42_41_fd_b1_ed_32_01_3a_08_d5_d0_43_11_d3_5c_b8_d0_11_cd_10
11:37:42.414761 IP xxxxx.54714 > 192.168.1.1.snmp:  F=apr U=zabbix [!scoped PDU]7d_b8_10_1d_9e_97_ab_91_85_90_17_db_6d_a1_03_c2_12_88_23_bb_2d_b9_31_1e_d8_23_41_78_41_2a_9b_34_0f_06_ef_77_8c_49_7e_f7_7b_50_a8_2a_aa_a3_e1_d8_c3_c9_13_df_6f_78_9e_44_61_a0_51_ca_3e_30_ed_a3
11:37:57.409912 IP xxxxx.49797 > 192.168.1.1.snmp:  F=r U= E=  C= GetRequest(14)
11:37:57.416074 IP xxxxx.34127 > 192.168.1.1.snmp:  F=apr U=zabbix [!scoped PDU]c0_7c_d8_6e_b3_e6_f4_00_81_0d_f7_d7_15_d2_db_7a_c0_ed_71_f9_80_48_14_1c_84_6d_98_eb_e5_ff_98_75_08_9c_a2_15_3d_b7_b6_44_1b_ce_76_5d_3b_14_73_9c_db_d5_ce_2c_1e_28_49_9c_6b_2d_2b_60_e2_2a_ae_d0
11:37:57.423371 IP xxxxx.49797 > 192.168.1.1.snmp:  F=apr U=zabbix [!scoped PDU]5e_1a_a5_ee_72_e4_42_56_74_54_7a_c2_27_ee_ad_9e_5d_11_e0_d9_a9_7c_72_a2_8c_c7_f5_6d_a8_75_c8_ba_77_b8_d0_8e_f7_de_51_0a_58_92_c7_2f_b0_95_b9_f2_be_c6_18_5b_6f_f0_c6_41_47_68_9a_d0_d8_15_df_6a
11:38:12.437055 IP xxxxx.49797 > 192.168.1.1.snmp:  F=apr U=zabbix [!scoped PDU]51_af_14_85_99_df_e2_37_70_df_fc_cb_4d_b1_f0_6f_80_83_f6_33_1d_a9_c6_da_90_3a_60_b4_11_e7_10_8e_03_ba_d7_68_88_99_2d_83_94_8d_de_43_c7_bf_70_ad_69_b4_3d_0c_21_78_6b_27_7c_8b_88_47_86_05_e8_00
11:38:23.320582 IP xxxxx > 192.168.1.1: ICMP echo request, id 10905, seq 1, length 64
11:38:24.320584 IP xxxxx > 192.168.1.1: ICMP echo request, id 10905, seq 8, length 64
11:38:25.321515 IP xxxxx > 192.168.1.1: ICMP echo request, id 10905, seq 15, length 64

Then as soon as I restart the zabbix server daemon, the snmp is working again and I get the email from zabbix saying the firewall has been restarted...

Screenshot from zabbix hosts page showing the firewall snmp times out:
http://i.imgur.com/N2XpzgI.png

What are the next steps?

Comment by Aleksandrs Saveljevs [ 2016 Jun 20 ]

I have tried reproducing the problem with an SNMPv3 device by blocking and unblocking packets with iptables (with an optional device restart between blocking and unblocking), but the issue did not manifest.

The tcpdump you provided shows packets traveling in one direction only. Did they travel in the other direction as well? Could you please also post the tcpdump in binary form (both during the problem and after the problem), so that we can inspect the packets and their differences more closely in Wireshark?

Comment by Janne Korkkula [ 2016 Sep 05 ]

We're also experiencing this exact same issue (temp.disable == gone for good) with 2.2.14 and SNMPv3 core switches, ours are monitored via a routed private network. I increased the timeouts to UnreachableDelay=30 and UnreachablePeriod=180, remains to be seen if it helps any, but of course it won't fix the issue itself.

Comment by richlv [ 2016 Sep 05 ]

jannek, those parameters are not timeouts. you should probably change them back.

Comment by Janne Korkkula [ 2016 Sep 06 ]

True they're not timeouts per se, but increasing UnreachablePeriod should give more time for the clients/network to recover, so at least in theory we should hit this very severe issue less frequently. As for UnreachableDelay of 30, 180/30=6, twice the count of connection attempts vs. the default 45/15, but I returned that value back to 15 since it doesn't matter much.
I did also increase StartPollersUnreachable from the previous value of 5 to 12, the process busy counter now averages at 23% and seems to peak at 85%, both still considerably higher than before, so I'll probably need to increase it even more (maybe to 20).

We can live with a 180 vs. 45 s delay in having (all) hosts considered unavailable until this is fixed, but can't have core network components randomly lose all monitoring.

Comment by richlv [ 2016 Sep 06 ]

got it. note that there's indication of what could be wrong yet - you'd probably have to debug this yourself, otherwise it is unlikely to be fixed (if there's anything to be fixed at all)

Comment by Janne Korkkula [ 2016 Sep 07 ]

Just to keep you up to date, we migrated SNMP monitoring (both v2 and v3) to one dedicated proxy. Reduces load, enables local settings for UnreachableDelay and UnreachablePeriod and makes debugging a bit more convenient. Of course there's nothing to debug right now, that'd be too easy.

Comment by Ivan Prokudin [ 2017 Jan 26 ]

Hello! I've mostly resolved the issue by setting different engine-id on my routers (shame on me for missing so simple thing).

But very rarely I'm still getting the issue. I've updated my zabbix to 3.2.3, I use mikrotik routers with the latest routeros.

As for now I had one router stucked in "host unavailable" situation and I don't restart zabbix server to investigate the situation. I've just started tcpdump to collect traffic between it and zabbix. Sorry, but I can't share the dump publicly, please tell me how and whom can I send the dump privately. Also please tell me what information is needed.

Some actual information:
Zabbix: 3.2.3 from official repo
OS: Debian Jessie.
SNMP: v3 with aes and sha1 over internet.
I don't use proxy for this monitoring and my zabbix config is:
LogType=system
PidFile=/var/run/zabbix/zabbix_server.pid
DBHost=db
DBName=zabbix
DBUser=zabbix
DBPassword=password
StartPollers=20
StartPollersUnreachable=5
StartPingers=5
StartHTTPPollers=5
CacheSize=128M
ValueCacheSize=512M
Timeout=10
AlertScriptsPath=/etc/zabbix/alertscripts
ExternalScripts=/usr/lib/zabbix/externalscripts
FpingLocation=/usr/bin/fping
Fping6Location=/usr/bin/fping6

Zabbix got no information from the host after this log records:
Jan 26 08:10:30 zabbix zabbix_server[17894]: SNMP agent item "ifInErrors[0-control-vpn]" on host "router" failed: first network error, wait for 15 seconds
Jan 26 08:11:05 zabbix zabbix_server[17904]: SNMP agent item "ifInErrors[0-control-vpn]" on host "router" failed: another network error, wait for 15 seconds
Jan 26 08:11:15 zabbix zabbix_server[17901]: temporarily disabling SNMP agent checks on host "router": host unavailable

Comment by Janne Korkkula [ 2017 Jan 26 ]

Excellent that you've got the dump running, maybe it'll give some clues.

With the dedicated SNMP proxy with tweaked UnreachableDelay and UnreachablePeriod settings everything has worked quite well, but after a Serious Fubar Condition last week the issue surfaced as intermittent SNMPv3 monitoring failures without any apparent reason. Restarting the proxy fixed everything.

UnreachablePollers occasionally hit 90-100% busy with these settings:

DataSenderFrequency=1
StartPollers=100
StartPollersUnreachable=25
StartPingers=3
Timeout=15
UnreachablePeriod=120
UnreachableDelay=20

(I just raised StartPollersUnreachable -> 50 (and StartPingers -> 5), let's see what happens.)

Comment by Ivan Prokudin [ 2017 Feb 03 ]

Ping for zabbix developers. I have pretty dump of snmp packets for you 2,5MB large. Please pay a little attention here and tell me whom can I send it privately to resolve this issue.

Comment by Ivan Prokudin [ 2017 Feb 10 ]

Hello! It's stably happens after devices reboot. May be mikrotik issue (have only their devices with snmp) or not, but I surely need help of zabbix developers. May I have your attention please?

Comment by Vladimir Dovgopol [ 2017 Mar 27 ]

Hi team,
I have same issue, after temporary losing connection to my remote routers, zabbix marks this hosts like unavailable. Zabbix is monitoring those hosts through snmp V3, After restore reachability to remote hosts, snmp remains stil unavailable. This problem is fixed by restarting zabbix server process.
Zabbix server version: 3.2.3.
Zabbix conf:
LogFile=/tmp/zabbix_server.log
DBName=zabbix
DBUser=zabbix
DBPassword=fgWt12KyQ
StartPollers=100
StartPollersUnreachable=10
StartPingers=15
Timeout=20
ExternalScripts=${datadir}/zabbix/externalscripts
FpingLocation=/usr/bin/fping
LogSlowQueries=3000

Some logs:
17551:20170327:030134.324 SNMP agent item "CPU_LOAD" on host "Kremenchug WatchGuard" failed: another network error, wait for 15 seconds
17628:20170327:030159.788 temporarily disabling SNMP agent checks on host "Kremenchug WatchGuard": host unavailable
17669:20170327:090828.082 escalation cancelled: host "Kremenchug WatchGuard" disabled.
17610:20170327:090934.338 SNMP agent item "Eth6_out" on host "Kremenchug WatchGuard" failed: first network error, wait for 15 seconds
17626:20170327:091029.673 temporarily disabling SNMP agent checks on host "Kremenchug WatchGuard": host unavailable

I hit this issue several times a week, if you need additional debug information I can provide it to you.

Comment by richlv [ 2017 Mar 27 ]

VladimirDovgopol, when the problem happens, what is printed in the logfile at debuglevel 4 for the unreachable poller process ?
if nothing, what do you see when you attach to it with strace ?

Comment by Jonathan Rioux [ 2017 Mar 27 ]

@richlv, I already posted the log content at debuglevel 4 concerning this issue: comment-183260

logfile with debuglevel=4
32684:20160601:112942.245 In get_values_snmp() host:'192.168.1.1' addr:'192.168.1.1' num:1
32684:20160601:112942.305 SNMPv3 [[email protected]:161]
32684:20160601:113012.346 getting SNMP values failed: Timeout while connecting to "192.168.1.1:161".
Comment by richlv [ 2017 Mar 27 ]

jorioux, that seems to show a timeout on the device, which seems to be unlikely a problem with zabbix

Comment by Jonathan Rioux [ 2017 Mar 27 ]

@richlv, in all respect, the problem is definitely with zabbix. While Zabbix is not able to poll the router with snmp, when I do an snmpwalk from the zabbix server, it successfully polls the information! How can you explain that?

Comment by Joerg Schwarzwaelder [ 2017 Mar 27 ]

I have exactly the same issue.
I was hoping that an upgrade from 2.4.7 to 3.2.4 fixes this issue, but it still persists.

Comment by Ivan Prokudin [ 2017 Mar 27 ]

I would like to ask all users that have the same issue to post which hardware they are monitoring by SNMP when the issue happens. As for me I have only mikrotik hardware. And also please vote for the bug cause zabbix team will not fix it for free if it's not interesting for many users.

Comment by Joerg Schwarzwaelder [ 2017 Mar 27 ]

we have this issue at least with Citrix Netscalers.

Comment by Jonathan Rioux [ 2017 Mar 27 ]

for me the issue is with a Juniper SSG140.

Comment by Ivan Prokudin [ 2017 Mar 27 ]

Do you use zabbix proxy or server for sending SNMP requests? What OS does it run on? Do it happens only with SNMPv3 or also the earlier versions?

As for me zabbix seems to have some issues with SNMPv3.

Comment by Joerg Schwarzwaelder [ 2017 Mar 27 ]

straight from the Zabbix server running CentOS. We are using SNMPv3.

Comment by Ivan Prokudin [ 2017 Mar 27 ]

Please tell which exactly centos version do you use?

Comment by Joerg Schwarzwaelder [ 2017 Mar 27 ]

CentOS 6.8

Comment by Vladimir Dovgopol [ 2017 Mar 27 ]

Firewals WatchGuard XTM equipment, different models (22, 33, 510, 525), we see this issue on all models. Protocol SNMP v3.
When trouble will happend again I'm going to grep some usefull logs.
Debian 8

Comment by Ivan Prokudin [ 2017 Mar 27 ]

So it's seems that it doesn't depend on hardware being monitored, nor linux distribution, nor net-snmp version. But happens only with SNMPv3.

Do you all have different engine-id set up on your devices? It's mandatory for SNMPv3.

Comment by Vladimir Dovgopol [ 2017 Mar 28 ]

Yes, I have different engine-id on my devices.

covs-sys-zabb1:/tmp# snmpwalk -v3 ..... 1.3.6.1.6.3.10.2.1.1.0
SNMP-FRAMEWORK-MIB::snmpEngineID.0 = Hex-STRING: 80 00 0C 19 04 38 30 42 45 30 34 36 36 31 30 41
32 34
covs-sys-zabb1:/tmp# snmpwalk -v3 ..... 1.3.6.1.6.3.10.2.1.1.0
SNMP-FRAMEWORK-MIB::snmpEngineID.0 = Hex-STRING: 80 00 0C 19 04 37 30 41 32 30 33 42 31 30 45 33
44 45
covs-sys-zabb1:/tmp# snmpwalk -v3 ..... 1.3.6.1.6.3.10.2.1.1.0
SNMP-FRAMEWORK-MIB::snmpEngineID.0 = Hex-STRING: 80 00 0C 19 04 38 30 42 45 30 34 36 38 43 34 46
41 30

Comment by Joerg Schwarzwaelder [ 2017 Mar 28 ]

same here:
[jukaksd@dcfra-zbx-server ~]$ snmpwalk -v3 -u user -l authPriv -a MD5 -A X -x DES -X X IP1 1.3.6.1.6.3.10.2.1.1.0
SNMP-FRAMEWORK-MIB::snmpEngineID.0 = Hex-STRING: 80 00 1F 88 80 36 58 83 24 01 FE C1 55 00 00 00 00
[jukaksd@dcfra-zbx-server ~]$ snmpwalk -v3 -u user -l authPriv -a MD5 -A X -x DES -X X IP2 1.3.6.1.6.3.10.2.1.1.0
SNMP-FRAMEWORK-MIB::snmpEngineID.0 = Hex-STRING: 80 00 1F 88 80 F7 3F 25 67 FB 05 C2 55 00 00 00 00
[jukaksd@dcfra-zbx-server ~]$ snmpwalk -v3 -u user -l authPriv -a MD5 -A X -x DES -X X IP3 1.3.6.1.6.3.10.2.1.1.0
SNMP-FRAMEWORK-MIB::snmpEngineID.0 = Hex-STRING: 80 00 1F 88 80 D9 6A 81 31 CC 69 26 56 00 00 00 00
[jukaksd@dcfra-zbx-server ~]$ snmpwalk -v3 -u user -l authPriv -a MD5 -A X -x DES -X X IP4 1.3.6.1.6.3.10.2.1.1.0
SNMP-FRAMEWORK-MIB::snmpEngineID.0 = Hex-STRING: 80 00 1F 88 80 37 5C 68 33 26 77 26 56 00 00 00 00

Comment by Janne Korkkula [ 2017 Mar 28 ]
  • Zabbix main server 3.2.4 on RHEL 7.3, db and gui on two more RHEL 7.3 hosts, three proxies (RHEL 6.9, mysql) handle host connections
  • Dedicated SNMP proxy
    • has increased timeouts, which help: Timeout=15, UnreachablePeriod=120, UnreachableDelay=20
    • lots of pollers, currently StartPollers=100, StartPollersUnreachable=50, StartPingers=5
    • number of busy pollers may increase considerably when problems arise (not always)
    • avg 50 new values/s, max 110
  • Problem definitely only affects SNMPv3 devices
    • may be caused by any network "hiccup", such as rebooting some small leaf switch somewhere
    • many times now we see bouncing instead of a complete dropping out of monitoring
    • Only a dozen switches/routers are connected with SNMPv3 (most use the SNMPv2 agent, no problems with them)
      • Devices have unique engine-id's
      • Devices with issues are usually large and/or busy
        • HPE 5900AF-48XG-4QSFP+ Switch, Comware / 7.1.045 Release 2422P03, 58 active interfaces
        • HPE 5900AF-48XG-4QSFP+ Switch, Comware / 7.1.045 Release 2422P01, 266 active interfaces
  • SNMPwalk/get works fine, while Zabbix considers the devices unavailable
  • Restarting the proxy always clears all problems
    • "permanently" unavailable devices become available
    • all available/unavailable bouncing stops immediately
    • number of busy pollers drops to normal if it was high
Comment by richlv [ 2017 Mar 28 ]

could this be the same as ZBX-8385 ?

Comment by Ivan Prokudin [ 2017 Mar 28 ]

richlv, can't answer your question cause have no skills to understand ZBX-8385. If somebody can make it to clear steps "How to test?" I can test in my environment.
Or as I said many times before I can send dump of my network traffic to any of developers. Can't share them publicly because I use SNMPv3 over internet and regardless the fact SNMPv3 is encrypted, dumps contain public IPs of my network devices.

Comment by Oleksii Zagorskyi [ 2017 Mar 28 ]

Hi all, snmV3 troubleshooting lover here
Richlv did a correct thing to kick me in the ZBX-8385.
After reading current discussion (didn't know about its existence previously) I feels that all cases basically will be duplicate of the ZBX-8385.

If someone wants to send sensitive info, like tcpdump, to zabbix team, you can send it to support at zabbix dot com, I'll check it myself.
Note - only raw dump is acceptable (dump written to file by -w), text output to shell is not useful at all.

For devices which have the mysterious issue, please execute NOW a test described in following comment and write down your result.
Then repeat the test after you got the issue for the device. Show please both test results here.
https://support.zabbix.com/browse/ZBX-8385#comment-114648

Ivan, why in posted debug log we don't see a line with "snmp_synch_response" text ?
You can see examples in this comment https://support.zabbix.com/browse/ZBX-8385#comment-133477

Comment by Ivan Prokudin [ 2017 Apr 01 ]

Oleksiy, hello!

Sorry, missed up that I need to execute snmpget command before the problem. So I've executed it while the problem exists and after it had gone away (zabbix-server restarted).
When problem exists:

# snmpget -v 3 -a SHA -A apass -l authPriv -u private -x AES -X epass hostname .1.3.6.1.2.1.1.1.0 -Dlcd
registered debug token lcd, 1
lcd_set_enginetime: engineID 80 00 1F 88 80 6E ED 4D 4E ED A6 DF 58 00 00 00 
00 : boots=1, time=0
lcd_set_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0
lcd_set_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0
lcd_get_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0
lcd_set_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0
lcd_get_enginetime_ex: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0
lcd_set_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=3133
SNMPv2-MIB::sysDescr.0 = STRING: RouterOS RB750Gr3

When problem had gone away:

snmpget -v 3 -a SHA -A apass -l authPriv -u private -x AES -X epass host .1.3.6.1.2.1.1.1.0 -Dlcd
registered debug token lcd, 1
lcd_set_enginetime: engineID 80 00 1F 88 80 6B 5A F7 5F 59 A8 DF 58 00 00 00 
00 : boots=1, time=0
lcd_set_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0
lcd_set_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0
lcd_get_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0
lcd_set_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0
lcd_get_enginetime_ex: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=0
lcd_set_enginetime: engineID 80 00 3A 8C 04 72 30 32 74 69 63 6B : boots=0, time=3497
SNMPv2-MIB::sysDescr.0 = STRING: RouterOS RB750Gr3

I also will send an email to you on support at zabbix with two dumps - while problem exists and while zabbix server is restarting.

And I haven't posted any debug logs cause it's impossible for me to create some of them because of high load of zabbix. And I can't repeat the issus on any test systems.

Comment by Oleksii Zagorskyi [ 2017 Apr 01 ]

I was right - it's the case b)
In both outputs provided above, and in both tcpdumps (in 1st unencrypted packet reply) we see boots=0 provided by the snmpv3 agent (device).
That should not happen for correctly working devices.
The "boots" counter must be increased each time when you reboot the device, otherwise running zabbix server/proxy will CORRECTLY reject to communicate further.
The time=3497 means that the device has ~1 hour uptime after reboot.

This issue may be closed as duplicate.

Comment by Ivan Prokudin [ 2017 Apr 01 ]

Oleksiy, got it. Seems to be clear. But strange that so many different hardware (we have at least 3 people with different hardware in the topic) fails with it. But RFC make the issue clear. So I've just written to mikrotik with links to the bug and to ZBX-8385 . Will wait for their answer...

Comment by Janne Korkkula [ 2017 Apr 03 ]

Please reconsider carefully before closing this issue and/or ZBX-8385. Certainly there's an issue with monitoring SNMPv3 devices with Zabbix, it hasn't magically disappeared and (also) affects enterprise class network equipment. I'm not convinced snmpEngineBoots has anything to do with this issue, our two most problematic HPE's have been up rather long (84 and 299 d) and we've had several connectivity outages during this. Other monitoring (traffic is logged elsewhere, not in Zabbix) works fine at the same time.

Problem Child, no issues at the moment:

registered debug token lcd, 1
lcd_set_enginetime: engineID 80 00 1F 88 80 27 94 4D 10 D9 08 E2 58 00 00 00 
00 : boots=1, time=0
lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 47 9C BA 00 00 00 01 : boots=0, time=0
lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 47 9C BA 00 00 00 01 : boots=1, time=7278801
lcd_get_enginetime: engineID 80 00 63 A2 80 2C 23 3A 47 9C BA 00 00 00 01 : boots=1, time=7278801
lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 47 9C BA 00 00 00 01 : boots=0, time=0
lcd_get_enginetime_ex: engineID 80 00 63 A2 80 2C 23 3A 47 9C BA 00 00 00 01 : boots=0, time=0
lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 47 9C BA 00 00 00 01 : boots=1, time=7278801
SNMPv2-MIB::sysDescr.0 = STRING: HPE Comware Platform Software, Software Version 7.1.045, Release 2422P03
HPE 5900AF-48XG-4QSFP+ Switch
Copyright (c) 2010-2016 Hewlett Packard Enterprise Development LP

The other Usual Suspect, no issues now:

registered debug token lcd, 1
lcd_set_enginetime: engineID 80 00 1F 88 80 9C 62 9E 4F E5 08 E2 58 00 00 00 
00 : boots=1, time=0
lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=0, time=0
lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=5, time=25815091
lcd_get_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=5, time=25815091
lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=0, time=0
lcd_get_enginetime_ex: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=0, time=0
lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=5, time=25815091
SNMPv2-MIB::sysDescr.0 = STRING: HPE Comware Platform Software, Software Version 7.1.045, Release 2422P01
HPE 5900AF-48XG-4QSFP+ Switch
Comment by Oleksii Zagorskyi [ 2017 Apr 03 ]

Hi Janne !

We had many cases, which proved that those devices behave incorrectly regarding SNMPv3.
In some cases manufactures fixed the wrong behavior in their firmwares and provided it to users.

Your examples: 1st with boots=1 (lines where time=7278801 is a realistic uptime in seconds - 84 days) - will the counter grow after the device reboot?
Numbers like 0 or 1 for the "boot" counter should be carefully taken into account, because I don't think that the device has been rebooted only 1 or (or 2 as maximum) times in their life (after being powered first time after firmware burning or so).
You cam schedule the device reboot and check it after again. If the boot is still =1 - that's the case b)

2nd example - looks healthy because boots=5, so we may suppose it was less in the past and previous numbers should be preserver on some flash/etc during device reboots.
Just make sure that it will increase to 6 after reboot, and the "time" counter will be started from 0.

I do not hurry up by closing this as duplicate, but I don't see yet any prove that it's not a duplicate.

Comment by Ivan Prokudin [ 2017 Apr 10 ]

Oleksiy, I've gotten respond from mikrotik. They told me so:

Our interpretation of RFC is the following,

  • snmpEngineBoots, which is a count of the number of times the SNMP
    engine has re-booted/re-initialized since snmpEngineID was last
    configured; and,
  • once router boots it sets value to 0;
  • if SNMPP app (engine) crashes or reinitialises during the operation, count in incremented;
  • if everything is fine, value stays at 0.

Their words seem to be also logical. The problem is that you and mikrotik understand "SNMP Engine" differently. You think that "SNMP Engine == whole device" and they understand it as "SNMP daemon(service) on the device". And as for me they seems more logical then you. What for any device should count it's regular reboots?

By the way, I've just understand that issue can happen after, for example link to SNMP devices goes down and up after some time. How does this situation depens on "boots" value?

Comment by Oleksii Zagorskyi [ 2017 Apr 10 ]

Ivan, they are correct in 1st part, but that's not related to our case, because if agent's "snmpEngine" has been changed to an unique value - it will be anew for manager, so any snmpEngineBoots+snmpEngineTime will be accepted by the manager and stored into library's "enginetime_struct" structure, to be reused next times.
Again, that's not any problem for zabbix (manager).

But, their statement if everything is fine, value stays at 0 sounds incorrectly in the context.
If the agent (device) is rebooted (for any possible reason like crash/hard/soft reset etc), snmpEngineBoots must be increased on next startup.
Otherwise, according to RFC, the manager must to reject agent's suggested snmpEngineBoots+snmpEngineTime combination as it should be considered as an attack.
This logic is coded into the libnetsnmp, not zabbix code.

Any application, which would be running as a daemon and would use the shared library - would be related here the same way.

I''m agree that technically "SNMP Engine" it's sort of a daemon, not device. But to simplify discussion we just call it as a device (agent role).
For reboots - we consider the daemon restart. I shared all these details here https://support.zabbix.com/browse/ZBX-8385#comment-114210 but probably not much people reached that comment

I'd consider myself as too brave to argue with Mikrotik guys, but not in current case, because I've spend loooooot of tome for the topic and I'm pretty sure in my understanding of the RFC and correspondingly behavior of the libnetsnmp (not zabbix), which is correct as for, again, the RFC!

And last - link goes down/up should cause the discussed issue if monitored snmp device behaves according to the RFC.
I'm ready to consider your tcpdump captures, preferably gathered before the link issue and after.
Just remember that the reason of the link issue may happen on different points, for example the buggy snmp device has lost power - > rebooted, which of course will be our case b).

Comment by Ivan Prokudin [ 2017 Apr 10 ]

Oleksiy, would you be so kind to communicate to mikrotik directly? It think it makes progress slower to send your answers to them and their answers here? I will continue being a such type of transmitter but tell me if you can communicate directly to prove your position.

Comment by Oleksii Zagorskyi [ 2017 Apr 10 ]

Note - I did not test any their device(s), so I don't say that their device(s) behave incorrectly.

Hmm, honestly speaking I don't see a reason I need to communicate them and prove something.
But they are welcomed to continue discussion here, if required

Comment by Ivan Prokudin [ 2017 Apr 11 ]

Mikrotik guys answered me in two messages:
The first one:

Zabbix has not contacted us about this case.

We still do not see point, where MikroTik SNMP implementation violates RFC.

What kind of information do you track by SNMP engine boots?

Both vendors don't wanna connect each other, so only users suffer. OK, it's not very hard to forward messages here and to mikrotik. Can you shortly summarize what should I answer them? Especially on the last question?

And the second answer from Mikro

We will modify to increment SNMP engine boots after /system reboot.

/system reboot is the command that reboots mikrotik router. But as I understand you, Oleksiy, boots should be incremented on every boot? For example if router was rebooted because of power loss? Am I right?

Comment by richlv [ 2017 Apr 11 ]

iprok, please note that zalex_ua has also referenced the industry-standard netsnmp implementation. mikrotik snmp implementation is not compatible with any vendor that would be using libnetsnmp.

Comment by Ivan Prokudin [ 2017 Apr 11 ]

richlv. I don't try to tell you that zabbix developers are wrong. But I will have not enough skills not to debug the problem nor resolve it nor tell mikrotik the decision. They seems to be ready to have a discussion with me (just to be honest quicker then zabbix developers - less then couple of years ), so the only problem to tell and prove them what to do. And I kindly ask Oleksiy to help me with it.

Comment by richlv [ 2017 Apr 12 ]

iprok, oh, not saying that it is bad to push forward with this - and really glad to hear mikrotik is responding to this.
but it is also important to note that zabbix team might not have a mikrotik device that misbehaves, and i know zalex_ua in general is too eager to dig into problems - we have to save his energy a bit

Comment by Oleksii Zagorskyi [ 2017 Apr 12 ]

Ivan, let me clarify that our assumption is based on your tests. I did not state myself that Mikrotik devices are doing something wrong as for SNMPv3 proto communication.

Well, at this point, midnight here, I disturbed my cousin, who has as RB2011UiAS-2HnD-IN at home ...
He said that firmware version should be recent.

Enabling SNMP on the Mikrotik router and after ~20 minutes (spent to configure routing etc) we performed our first test:

# snmpget -v 3 -a SHA -A apass123 -l authPriv -u private -x AES -X hostname .1.3.6.1.2.1.1.1.0 -Dlcd
No log handling enabled - turning on stderr logging
registered debug token lcd, 1
lcd_set_enginetime: engineID 80 00 1F 88 80 8A 8F 5F 06 21 46 ED 58 : boots=1, time=0
lcd_set_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0
lcd_set_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0
lcd_get_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0
lcd_set_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0
lcd_get_enginetime_ex: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0
lcd_set_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=1134
SNMPv2-MIB::sysDescr.0 = STRING: RouterOS RB2011UAS-2HnD

then we rebooted it (from winbox tool):

# snmpget -v 3 -a SHA -A apass123 -l authPriv -u private -x AES -X epass123 hostname .1.3.6.1.2.1.1.1.0 -Dlcd
No log handling enabled - turning on stderr logging
registered debug token lcd, 1
lcd_set_enginetime: engineID 80 00 1F 88 80 CD 71 B4 27 26 47 ED 58 : boots=1, time=0
lcd_set_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0
lcd_set_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0
lcd_get_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0
lcd_set_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0
lcd_get_enginetime_ex: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=0
lcd_set_enginetime: engineID 80 00 3A 8C 04 31 32 33 31 32 33 31 32 33 31 3233 31 32 33 : boots=0, time=107
SNMPv2-MIB::sysDescr.0 = STRING: RouterOS RB2011UAS-2HnD

you can see that after reboot, "boots" stays =0, while "time" corresponds with device's (EngineID's, to be precise technically) uptime.
At this point I can say for sure - Mikrorik behaves incorrectly!

In the ZBX-8385 I already quoted related parts of RFC3414 (did anyone try to read that, btw? )
Let me copy here some important parts, to pass them to Mikrotik guys:

An authoritative SNMP engine is required to maintain the values of
its snmpEngineID and snmpEngineBoots in non-volatile storage.
...
Rather, each time an SNMP engine
re-boots, it retrieves, increments, and then stores snmpEngineBoots
in non-volatile storage, and resets snmpEngineTime to zero.
...
If snmpEngineTime ever
reaches its maximum value (2147483647), then snmpEngineBoots is
incremented as if the SNMP engine has re-booted and snmpEngineTime is
reset to zero and starts incrementing again.
...
2) if any of the following conditions is true, then the
message is considered to be outside of the Time Window:

  • the local notion of the value of snmpEngineBoots is
    2147483647;
  • the value of the msgAuthoritativeEngineBoots field is
    less than the local notion of the value of
    snmpEngineBoots; or,
  • the value of the msgAuthoritativeEngineBoots field is
    equal to the local notion of the value of snmpEngineBoots
    and the value of the msgAuthoritativeEngineTime field is
    more than 150 seconds less than the local notion of the
    value of snmpEngineTime.

...
Note that this procedure allows for the value of
msgAuthoritativeEngineBoots in the message to be greater
than the local notion of the value of snmpEngineBoots to
allow for received messages to be accepted as authentic
when received from an authoritative SNMP engine that has
re-booted since the receiving SNMP engine last
(re-)synchronized.

So, snmpEngineBoots must be increased after each EngineID restart, but to simplify things we just say after each device reboot.

How snmpd, provided by "net-snmp", ensures that engineBoots will be increased next time for sure? - it increases the engineBoots on startup, updates the run-time conf file and uses this increased counter.
Example n my Linux box:

# ps aux | grep snmpd; grep engineBoots /var/lib/snmp/snmpd.conf; service snmpd start; grep engineBoots /var/lib/snmp/snmpd.conf; ps aux | grep snmpd
engineBoots 372
engineBoots 373
snmp     14407  0.0  0.1  69408 16532 ?        S    01:13   0:00 /usr/sbin/snmpd -LS5-0d -Lf /dev/null -u snmp -g snmp -I -smux mteTrigger mteTriggerConf -p /run/snmpd.pid
# snmpget -v 3 -a MD5 -A publicV3 -l authPriv -u publicV3 -X publicV3 localhost .1.3.6.1.2.1.1.1.0 -Dlcd
registered debug token lcd, 1
lcd_set_enginetime: engineID 80 00 1F 88 80 95 B5 C2 6E 56 94 D8 54 00 00 0000 : boots=400, time=0
lcd_set_enginetime: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 0000 : boots=0, time=0
lcd_set_enginetime: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 0000 : boots=373, time=5
lcd_get_enginetime: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 0000 : boots=373, time=5
lcd_set_enginetime: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 0000 : boots=0, time=0
lcd_get_enginetime_ex: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 0000 : boots=0, time=0
lcd_set_enginetime: engineID 80 00 1F 88 80 93 4E DE 66 FB 80 95 53 00 00 0000 : boots=373, time=5
SNMPv2-MIB::sysDescr.0 = STRING: Linux it0 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64

Answering to your question - snmpEngineBoots must be increased between each EngineID restart, for any reboot reason - soft reset or power lost does not matter.

This is probably very last time when I spent so much time for such discussions

Comment by Oleksii Zagorskyi [ 2017 Apr 12 ]

I'm closing this one as duplicate of ZBX-8385

Short discussions could be continued.

Comment by Ivan Prokudin [ 2017 Apr 12 ]

Oleksiy, thank you very much. I've emailed mikrotik guys with link to your answer (and also copied it in my mail). You've done a really great job for us (zabbix users with mikrotiks). BTW, I've told not once that I'm ready to give access to everything that is needed to debug the issue, not to waste time to configure mikrotik device (BTW they have Cloud Hosted Router - free image of virtual machine for any hypervizor, no need to look for hardware mikrotik device). But for now it seems that it's no more questions to you. Thank you very much again. You've raised my feelings about zabbix support team to very high level.

Comment by Janne Korkkula [ 2017 Apr 18 ]

As I protested earlier, closing this issue off because some home router is proven buggy is/was a bad call...

One of our usual suspects, the one with 5 boots and now 313 days of uptime, a HPE 10G switch in one of our datacenters, is currently being considered unavailable by our Zabbix via its dedicated SNMP proxy. All SNMP requests work fine.

Here's the relevant bit of the proxy log:

 64039:20170415:011340.499 resuming SNMP agent checks on host "srv-irf": connection restored
 63983:20170415:011520.852 SNMP agent item "ifAdminStatus[Ten-GigabitEthernet4/0/44]" on host "srv-irf" failed: first network error, wait for 20 seconds
 64015:20170415:011610.153 SNMP agent item "ifAdminStatus[Ten-GigabitEthernet4/0/26]" on host "srv-irf" failed: another network error, wait for 20 seconds
 64034:20170415:011625.133 SNMP agent item "ifOperStatus[Ten-GigabitEthernet4/0/43]" on host "srv-irf" failed: another network error, wait for 20 seconds
 64006:20170415:011640.190 SNMP agent item "ifAdminStatus[Ten-GigabitEthernet2/0/38]" on host "srv-irf" failed: another network error, wait for 20 seconds
 64009:20170415:011730.140 temporarily disabling SNMP agent checks on host "srv-irf": host unavailable

Note how it says "temporarily"? It stays disabled until the proxy is restarted.

And here's the snmpget result, issue still active, ie. no-one has restarted the proxy yet. We have about two hours until it has to be done.

registered debug token lcd, 1
lcd_set_enginetime: engineID 80 00 1F 88 80 F9 6E 11 3C 41 EA F5 58 00 00 00 
00 : boots=1, time=0
lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=0, time=0
lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=5, time=27117965
lcd_get_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=5, time=27117965
lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=0, time=0
lcd_get_enginetime_ex: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=0, time=0
lcd_set_enginetime: engineID 80 00 63 A2 80 2C 23 3A 41 D6 21 00 00 00 01 : boots=5, time=27117965
SNMPv2-MIB::sysDescr.0 = STRING: HPE Comware Platform Software, Software Version 7.1.045, Release 2422P01
HPE 5900AF-48XG-4QSFP+ Switch
Copyright (c) 2010-2015 Hewlett Packard Enterprise Development LP
Comment by Oleksii Zagorskyi [ 2017 Apr 18 ]

Janne, let's not mix your case with reporter's one and consider your case from scratch in a new ZBX.
If you are sure that something is wrong with zabbix - file please the ZBX with as much detail as you can: more log, host XML export, proxy config etc.
Then let me know the ZBX number.
ADDED: and try to disable bulk requests option for the snmp host interface.

Comment by Ivan Prokudin [ 2017 Apr 18 ]

Janne, to prove that it's not fault of vendor of your hardware first of all you should show output of
snmpget -v 3 -a SHA -A apass -l authPriv -u private -x AES -X epass host .1.3.6.1.2.1.1.1.0 -Dlcd
sent to you device after several reboots - one request after each reboot (if you can't reboot your production switch, try to use same hardware) to be sure that it increases boot numbers so as Oleksiy and RFC describes. If it's going according RFC, then it can be zabbix fault. If not then it's your vendor fault.

Regardless all this, I (as topic starter) fully confirm that this ticket can be closed as duplicate of ZBX-8385. I can't rename it's summary to highlight that it's mikrotik not zabbix issue.

Comment by Janne Korkkula [ 2017 Apr 18 ]

Ivan, our problem children (them two HPE 5900's) are not rebooted frequently, those counts of 1 and 5 are very likely to be true. It takes half an hour just to complete a reboot cycle of the larger of the two. The counter problem is not behind our variant of the same symptom, it must be something else.

ZBX-12064 now open..

Comment by Théo Castelo N. de Araújo [ 2018 Jan 31 ]

Hi everyone, I have the same problem, after Host unavailability, SNMP v2 checks on a router was unavailable, testing inside the server, using snmpwalk, the data returns normally, already in zabbix, it gives timeout error.

zabbix server 3.4.1
snmp v2

https://i.imgur.com/8O4NjGD.png

Any news about this issue?

Comment by Ali HBB [ 2019 Dec 04 ]

The problem  is definitely from zabbix, because we have solarwins with same access and ip range to our Cisco switch

that shows everything fine but zabbix still consider our Cisco switch as snmp timed out

 

Generated at Thu Apr 18 09:01:58 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.