Type: Problem report
Status: Need info
Affects Version/s: 4.2.8
Fix Version/s: None
Sprint:Sprint 58 (Nov 2019), Sprint 59 (Dec 2019), Sprint 60 (Jan 2020), Sprint 61 (Feb 2020), Sprint 62 (Mar 2020), Sprint 63 (Apr 2020)
It acted the same in version 3.4 too. Tried to upgrade but it did not help.
Steps to reproduce:
- Setup few SNMPv3 hosts
- After some time (few to several hours) notice all pollers (both unreachable and regular ones) are busy. Hosts are supposedly down.
see "Annotation 2019-11-15 133726.jpg"
Please bear with me...
I checked few things and this is what I found out.
First I went to see PS output and noticed that poller and unreachable pollers descriptions do not update at all. See 'ps-pollers-getting-values.jpg'.
Tried strace main zabbix process (with child processes), but there was no action there too. See 'strace-main-zabbix-server-with-child-processes-stuck.jpg'
Then went to strace the pollers. All of them were stuck on select call (tried waiting for a bit) without timeout reading from descriptor 10 - a UDP socket. See 'strace-poller-process-stuck-on-select.jpg' and 'lsof-udp-fd-10.jpg'
What's it doing? Here's a backtrace from gdb - see "gdb-poller-process-bt.jpg"
In zabbix sources it said NETSNMP has its own timeout values, I checked there and saw this piece of code (version 5.4.4) - notice / block without timeout / comment:
So all my pollers seem to be stuck waiting forever for a response from UDP socket.
After server restart it goes back to normal.