[ZBX-10530] zbx_tcp_accept() does read() without receive timeout Created: 2016 Mar 13  Updated: 2017 May 30  Resolved: 2016 Mar 31

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 3.0.1
Fix Version/s: 3.0.2rc1, 3.2.0alpha1

Type: Incident report Priority: Major
Reporter: Anssi Kolehmainen Assignee: Unassigned
Resolution: Fixed Votes: 3
Labels: encryption, timeout
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux (Debian/testing)


Attachments: File fix-timeout-in-accept.patch    
Issue Links:
Duplicate
is duplicated by ZBX-10469 Zabbix connection with proxies hang Closed
is duplicated by ZBX-10586 Trapper Problem Closed
is duplicated by ZBX-10595 Zabbix server hungs Closed

 Description   

I updated to Zabbix 3.0.1 (from 2.4.7) and after a random while Zabbix mostly stops working. All internal/simple checks work but no incoming data (i.e. trapper items).

Netstat shows that trapper TCP connections never timeout. Strace shows that recvfrom(9<TCP:[server:10051->agent:55098]>, "Z", 1, MSG_PEEK, NULL, NULL) call never finishes (until manual SIGALRM or similar).

trapper_thread() does basically zbx_tcp_accept() followed by zbx_tcp_recv_to()... But only the latter calls zbx_socket_timeout_set().

Due to adding TLS support zbx_tcp_accept() does MSG_PEEK for the first byte and it is done before socket timeouts are set so it will wait infinitely.

In my case we have lots of Zabbix agents on devices connected over mobile broadband networks which tend to have not so good connections. I get a new stuck connection maybe once or twice per hour so this leads to Zabbix server "crashing" once or twice per day.



 Comments   
Comment by Anssi Kolehmainen [ 2016 Mar 13 ]

Trivial fix for this issue. Move zbx_socket_timeout_set() a few lines earlier before the blocking read() call. No hung sockets after 4 hours and otherwise seems to be working fine.

Comment by richlv [ 2016 Mar 14 ]

ZBX-10469 might be the same problem

Comment by orogor [ 2016 Mar 18 ]

got the same issue on proxy

Comment by Aleksandrs Saveljevs [ 2016 Mar 22 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-10530 .

Comment by Andris Zeila [ 2016 Mar 23 ]

Successfully tested.

(1) One note, maybe instead of "from %s: recv() peek failed: %s" error message we should give less technical one - "reading first byte from connection failed" or something like. On the other hand - I'm not sure if this error message can reach frontend.

asaveljevs Added your message in r59185. RESOLVED.

wiper Looks good. CLOSED.

Comment by richlv [ 2016 Mar 28 ]

user-friendlier messages is always a good idea - whether they end up in the frontend, server log or elsewhere

Comment by Aleksandrs Saveljevs [ 2016 Mar 30 ]

Fixed in pre-3.0.2rc1 r59193 and pre-3.1.0 (trunk) r59194.

Comment by Diego e Silva de Souza [ 2016 Jul 11 ]

How and where do I apply this patch?

Can I correct this issue in a zabbix proxy installed via rpm package?

Comment by orogor [ 2016 Jul 11 ]

most likely, just update proxies
Affects Version/s:
3.0.1
Fix Version/s:
3.0.2rc1, 3.1.0 (trunk)

Generated at Fri Mar 29 00:02:40 EET 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.