[ZBX-2152] multiple SNMPv3 checks get unexpected unpredictable "network error" log messages - all about duplicated "SNMP EngineID" Created: 2010 Mar 12  Updated: 2020 Feb 04  Resolved: 2013 Jun 11

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 1.8.1
Fix Version/s: None

Type: Incident report Priority: Major
Reporter: Drozhdev Ivan Assignee: Oleksii Zagorskyi
Resolution: Won't fix Votes: 6
Labels: snmpv3
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian Etch 4.0, Zabbix 1.8.1


Issue Links:
Duplicate
is duplicated by ZBX-5028 Unusable SNMPv3 performance Closed
is duplicated by ZBX-832 SNMP v3 authentication problems after... Closed
is duplicated by ZBX-12064 Some SNMPv3 hosts don't become availa... Closed
is duplicated by ZBX-15391 SNMPv3 Checks report "Timeout while c... Closed

 Description   

Multiple SNMPv3 queries are not processed properly.

When Zabbix sends an requests to SNMPv3-enabled devices it incorrectly uses msgAuthoritativeEngineID, msgAuthoritativeEngineBoots and msgAuthoritativeEngineTime parameters returned by those devices by authNoPriv or authPriv security level of auth process. The parameters are unique for each device and depend on:

msgAuthoritativeEngineID: depeds on the device used
msgAuthoritativeEngineBoots: depeds on number of engine reboots
msgAuthoritativeEngineTime: depends on device's uptime

SNMPv3 devices (security level of authNoPriv or authPriv) respond to a get_snmp request only if all three parameters are set up correctly and are the same as current parameters inside in the devices. Otherwise device answers "usmStatsNotInTimeWindows" error, and Zabbix writes "Network Error" in own logs.

Zabbix saves msgAuthoritativeEngineID parameter correctly for each device by auth process. However, two other parameters are not saved correctly. Zabbix uses these parameters saved for the first device to query other devices in the network. The result is only one SNMPv3 device is monitored correctly. It may be a bug in software.

This problem is described here (in Russian): http://www.zabbix.com/forum/showthread.php?t=16093
A bug with similar symthoms described here: https://support.zabbix.com/browse/ZBX-832
SNMPv3 query/response message exchange is described here: http://www.insanum.com/docs/usm.html

"As mentioned above, the USM requires that the snmpEngineID, snmpEngineBoots, and snmpEngineTime of the authoritative engine be placed in the msgSecurityParameters. This requires the non-authoritative engine (i.e. manager) to know these values for the authoritative engine (i.e. agent) before a GET, NEXT, or SET operation can be completed.

This is achieved by a discovery process. There are two discovery transactions that occur. The first is to discover the snmpEngineID of the agent. The second is to discover the snmpEngineBoots and snmpEngineTime. The second transaction is only needed if the manager wants to use a security level of authNoPriv or authPriv. This is because the msgAuthoritativeEngineBoots and msgAuthoritativeEngineTime are used by the timeliness module which is part of the authentication process.

The first discovery transaction is initiated by the manager sending an SNMPv3 packet with the msgAuthoritativeEngineID containing a bogus value. When the agent receives a packet where the msgAuthoritativeEngineID is different than its own, the packet is discarded and a discovery packet is returned to the manager. The returned discovery packet contains the correct snmpEngineID which must be used by the manager.

The second discovery transaction requires an authenticated packet be sent to the agent. This means that the authentication flag is set in the msgFlags, and the msgAuthenticationParameters contains the computed message digest for the packet. The secret key used for authenticating the packet is from the user specified in msgUserName. What makes this a discovery packet is that the msgAuthoritativeEngineBoots and msgAuthoritativeEngineTime contain bogus values. When the agent receives this packet, it is first authenticated. Once the authentication is completed, the msgAuthoritativeEngineBoots and msgAuthoritativeEngineTime values are checked. Since the values are bogus, the packet is discarded and a second discovery packet is returned to the manager. The returned discovery packet is authenticated, using the same user, and contains the correct values of the snmpEngineBoots and snmpEngineTime which must be used by the manager."

"Once a manager has learned the snmpEngineBoots and snmpEngineTime of an agent, the manager must maintain its own local notion of what these values are supposed to be. This requires the manager to increment the learned snmpEngineTime every second so the value will be very close to the master values maintained by the agent. If the snmpEngineTime rolls over, then the snmpEngineBoots must be incremented. A manager must keep local notions of these values for each agent in which it wishes to communicate.

The timeliness checks by an agent are considered part of the authentication process and are done right after the received packet has been authenticated. If the msgAuthoritativeEngineBoots is different than the agent's current value of the snmpEngineBoots, the packet is discarded and a discovery packet is sent back to the manager. If that check passes, then the msgAuthoritativeEngineTime is checked against the agent's current value of the snmpEngineTime. If the difference between the two is more or less than 150 seconds, the packet is discarded and a discovery packet is sent back to the manager. If both of the checks pass, then the packet is considered to have been received in a timely manner and processing continues.

The value of +/- 150 seconds for the comparison of the snmpEngineTime is the default value specified by the RFC. This value could be modified to something more suitable based on the speed and size of your network. "



 Comments   
Comment by Elvar [ 2010 Jun 05 ]

This bug is affecting me as well. Recently started a monitoring project that requires snmpv3 and while I can manually snmpget data with no issues Zabbix keeps reporting connection timed out. When I took some packet captures I found similar results.

This also affects Zabbix version 1.8.2

Comment by Aleksandrs Saveljevs [ 2010 Jun 11 ]

We are looking into this problem and cannot seem to reproduce it in our test environment with NET-SNMP 5.4.1.

Does anyone of you who is having this problem has NET-SNMP 5.4.1 or newer? Some have mentioned that an upgrade to NET-SNMP 5.4 has fixed the problem.

We are going to continue investigating the issue, but any additional information would be appreciated.

Comment by Aleksandrs Saveljevs [ 2010 Jun 11 ]

Additionally I wish to mention that finding out snmpEngineID, snmpEngineBoots, and snmpEngineTime is totally a business of NET-SNMP. Zabbix opens a new session for each SNMP item and never sees or caches these values.

Comment by Patrick Burns [ 2010 Jun 11 ]

Using 'NET-SNMP version: 5.4.2.1' with Ubuntu Server 10.04 64bit.

As far as you trying to reproduce the issue, do you have more than one device being monitored via snmpv3 using AuthNoPriv from the same template? That is when we encounter the issue. The first device of the snmpv3 template seems ok. After we add a 2nd device the first device stops polling successfully and we see the following in the logs..

' 22265:20100611:082304.417 Item [Switch02:.1.3.6.1.2.1.2.2.1.10.2] error: Timeout while connecting to [172.16.253.27:161]'

Now, if that device is the only one we are trying to monitor it can poll data without an issue. It is as soon as we add a second or more devices of the same snmpv3 template that it causes problems. Also, we have no problems using 'snmpget' manually to successfully query the devices at any time which is why it seems like a zabbix issue to me.

Comment by Benjamin Coles [ 2010 Jun 12 ]

Worked with Patrick on this issue.

Wireshark shows the following packets:

snmpget
---- 
get-request
report engineids.0=113825
get-next-request IF-MIB 1
get-request IF-MIB 2
get-response IF-MIB 1

zabbix query
---- 
get-request IF-MIB 1
report engineids.0=113115
get-request IF-MIB 1
report usmStatsNotInTimeWindows.0

Under the Zabbix wireshark... it doesn't seem that zabbix gets a response from the server containing the msgAuthortitativeEngineTime. I don't see a cached user/pass method as described from earlier posts. I've suggested the user to downgrade from 5.4.2.1 to 5.4.1 per Aleksandrs comment that it ran without problems. Another thing to try is to setup a mock environment of 1.6.5 and see if this bug was newly introduced in 1.8.x series

Comment by Patrick Burns [ 2010 Jun 12 ]

I just finished downgrading to Net-SNMP-5.4.1 so as to be using the same version you devs are testing with and I'm still having the same issue. Devs, when testing please make sure you try using the same snmpv3 template for at least two devices and make sure to use AuthNoPriv or AuthPriv.

Again, I also tried copying the items from the template directly to the host and the issue is still the same.

Comment by Patrick Burns [ 2010 Jun 12 ]


One other thing which I don't think matters but just in case it does, the devices we are monitoring via snmpv3 are being monitored via Zabbix Proxy.

Comment by Patrick Burns [ 2010 Jun 13 ]

As a workaround to this issue we are having since it is completely holding us up on a major project we created a shell script to be called by Zabbix which gives us what we need until this bug or whatever our problem is is resolved. Below is the shell script we are using as an external check which is working great so far.

#!/bin/bash
#
# Simple Zabbix snmpget wrapper for SNMPv3 devices.
#

snmpuser="USERNAME"
snmpauth="PASSWORD"

# Show usage and exit
function usage() {
    echo "Usage: $0 anything <ip> <oid>"
    exit 1
}

# Throw the first argument sent by Zabbix away
shift

# Check for host
if [ -z "$1" ]
then
    echo "Host or IP required."
    usage
fi

host="$1"

# Check for OID
if [ -z "$2" ]
then
    echo "OID required."
    usage
fi

oid="$2"

out=$(snmpget -v 3 -u $snmpuser -l AuthNoPriv -a MD5 -A "$snmpauth" -t 1 -r 3 -Oqv "$host" "$oid" 2>/dev/null | cut -d'"' -f2)

if [ "$?" -eq 0 ]
then
    echo "$out"
else
    exit 1
fi
Comment by Patrick Burns [ 2010 Jun 17 ]

Thanks to a very helpful Zabbix Support team we found that what we thought was a bug was misconfigured switches on our clients network. They had configured their switches with duplicate msgAuthoritativeEngineID's which was the entire problem all along. After we had our client reconfigure their switch the issues we were having were no longer present.

Drozhdev Ivan, I'm very curious if you were dealing with duplicated msgAuthoritativeEngineID's as well.

Comment by richlv [ 2010 Jun 17 ]

added a note at http://www.zabbix.com/documentation/1.8/manual/config/items#snmp_agent

Comment by Oleksii Zagorskyi [ 2013 Mar 20 ]

I'm reopening this issue to add more details, attachments and add some more notes to documentation.

Comment by Oleksii Zagorskyi [ 2013 Mar 20 ]

I knew about this issue report already long time and I remembered it well all this time, but I didn't suppose that such issue can happen with me as well.

I spent more forces and time to figure out what's going on than I could if Zabbix documentation would contain more detail than at the moment.

That's why I want to describe how to notice this problem, classify it correctly and do not wast a lot of time for troubleshooting.
And how to figure out duplicated EngineIDs and their IPs.

Comment by richlv [ 2013 Jun 11 ]

summary : this issue turned out to be about duplicate engineid.

split out desired improvements to documentation and error reporting as ZBXNEXT-1789

zalex_ua I closed it and finally described the topic with new discovered details in ZBX-8385

Comment by Nicola Canepa [ 2013 Jul 06 ]

This behaviour could be a problem with active-standby units with automatic config replication, since the engineID is replicated along with the config (Cisco ASA, for example).
Is there a way (beyond using an external script) to avoid this?

Comment by Michael Schwartzkopff [ 2014 Dec 06 ]

Hi,

the problem is documented in ZBX-4164. Since you write that Zabbix caches the credentials, it is the fault of Zabbix that SNMPv3 is not usable. The problem still exsits in Zabbix 2.2 and probably in 2.4. Zabbix uses the wrong snmpTime.

When I restart the Zabbix server, the items get collected again. This is the proof, that the fault is located within the Zabbix server.

Michael.

Comment by Oleksii Zagorskyi [ 2014 Dec 06 ]

Michael, you are not right saying "fault is located within the Zabbix server".
I already mentioned ZBX-8385 where you can find there more details.

Comment by richlv [ 2017 Apr 26 ]

regarding zabbix detecting duplicate engineids and warning about them in a more useful way, quoting asaveljevs :

My research shows that NET-SNMP provides information about the discovered snmpEngineID. I propose to remember what snmpEngineID sits on what IP:port, and if we discover a collision, we provide a good log and NOTSUPPORTED message to help identify the problem. I do not think we should do anything beyond that.

Comment by gaj [ 2020 Feb 04 ]

i have some missing data with Template_Dell_iDRAC_SNMPv3 (https://github.com/endersonmaia/zabb...ter/dell/idrac),

in log file /var/log/zabbix_server.log i have

54713:20200204:150239.627 enabling SNMP agent checks on host "idrac15": host became available

all item work for one hour after that that not work, https://ibb.co/2FKQvL7

do you have an idea ?

Generated at Wed Apr 24 16:35:44 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.