-
Incident report
-
Resolution: Fixed
-
Major
-
3.0.1
-
Linux (Debian/testing)
I updated to Zabbix 3.0.1 (from 2.4.7) and after a random while Zabbix mostly stops working. All internal/simple checks work but no incoming data (i.e. trapper items).
Netstat shows that trapper TCP connections never timeout. Strace shows that recvfrom(9<TCP:[server:10051->agent:55098]>, "Z", 1, MSG_PEEK, NULL, NULL) call never finishes (until manual SIGALRM or similar).
trapper_thread() does basically zbx_tcp_accept() followed by zbx_tcp_recv_to()... But only the latter calls zbx_socket_timeout_set().
Due to adding TLS support zbx_tcp_accept() does MSG_PEEK for the first byte and it is done before socket timeouts are set so it will wait infinitely.
In my case we have lots of Zabbix agents on devices connected over mobile broadband networks which tend to have not so good connections. I get a new stuck connection maybe once or twice per hour so this leads to Zabbix server "crashing" once or twice per day.