Status: Need info
Sprint 58 (Nov 2019), Sprint 59 (Dec 2019), Sprint 60 (Jan 2020), Sprint 61 (Feb 2020), Sprint 62 (Mar 2020), Sprint 63 (Apr 2020), Sprint 64 (May 2020), Sprint 65 (Jun 2020), Sprint 66 (Jul 2020), Sprint 67 (Aug 2020), Sprint 68 (Sep 2020), Sprint 69 (Oct 2020), Sprint 70 (Nov 2020), Sprint 71 (Dec 2020), Sprint 72 (Jan 2021), Sprint 73 (Feb 2021), Sprint 74 (Mar 2021), Sprint 75 (Apr 2021), Sprint 76 (May 2021), Sprint 77 (Jun 2021), Sprint 78 (Jul 2021), Sprint 79 (Aug 2021), Sprint 80 (Sep 2021), Sprint 81 (Oct 2021), Sprint 82 (Nov 2021), Sprint 83 (Dec 2021)
It acted the same in version 3.4 too. Tried to upgrade but it did not help.
Steps to reproduce:
- Setup few SNMPv3 hosts
- After some time (few to several hours) notice all pollers (both unreachable and regular ones) are busy. Hosts are supposedly down.
see "Annotation 2019-11-15 133726.jpg"
Please bear with me...
I checked few things and this is what I found out.
First I went to see PS output and noticed that poller and unreachable pollers descriptions do not update at all. See 'ps-pollers-getting-values.jpg'.
Tried strace main zabbix process (with child processes), but there was no action there too. See 'strace-main-zabbix-server-with-child-processes-stuck.jpg'
Then went to strace the pollers. All of them were stuck on select call (tried waiting for a bit) without timeout reading from descriptor 10 - a UDP socket. See 'strace-poller-process-stuck-on-select.jpg' and 'lsof-udp-fd-10.jpg'
What's it doing? Here's a backtrace from gdb - see "gdb-poller-process-bt.jpg"
In zabbix sources it said NETSNMP has its own timeout values, I checked there and saw this piece of code (version 5.4.4) - notice / block without timeout / comment:
So all my pollers seem to be stuck waiting forever for a response from UDP socket.
After server restart it goes back to normal.