[ZBX-10851] Default timeout for active agent connect phase may result in a mild DoS of server/proxy over slow networks Created: 2016 May 27  Updated: 2024 Apr 10  Resolved: 2019 Apr 04

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G), Proxy (P), Server (S)
Affects Version/s: 3.2.0alpha1
Fix Version/s: 4.4 (plan)

Type: Problem report Priority: Major
Reporter: Glebs Ivanovskis (Inactive) Assignee: Andris Zeila
Resolution: Won't fix Votes: 1
Labels: network, tcp
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by ZBX-15657 Zabbix agent cannot connect to Zabbix... Closed
Team: Team A
Sprint: Sprint 50 (Mar 2019), Sprint 51 (Apr 2019)
Story Points: 0

 Description   

Here is how agent-server/proxy communication looks on TCP implementation level:

Time Zabbix agent client TCP layer server TCP layer Zabbix server/proxy
t=0 sets alarm() and calls connect() sends SYN, changes connection state to SYN_SENT connection in LISTEN state in accept() call
t=1s ... re-sends SYN ... ...
t=3s ... re-sends SYN ... ...
t=3+s gets SIGALRM and aborts connect() changes connection state to CLOSED ... ...
t=? ... ... receives SYN, responds with SYN/ACK, changes connection status to SYN_RECV ...
t=? ... ignores received SYN/ACK attempts several SYN/ACK retransmissions and finally (after some time) changes connection status to CLOSED ...

If round-trip time is over 3 seconds (or the first SYN gets lost and RTT is over 2 seconds or second SYN gets lost too) server/proxy will never get an ACK response and will end up with long-living "half-open" connection. If active agent count is sufficient enough connection queue will fill up and make server completely unreachable.

The problem is that default 3 seconds timeout interacts with TCP retransmission strategy in a very destructive fashion. When "half-open" connection queue is full incoming SYN packets are simply dropped which makes chances of third SYN to become "the one" very high. And since server has virtually no time to respond to it before agent aborts connection, recovery process is very difficult (if possible) even if network gets back to normal



 Comments   
Comment by Glebs Ivanovskis (Inactive) [ 2016 Aug 17 ]

ZBX-7142 may be distantly related.

Comment by Andris Zeila [ 2019 Mar 26 ]

I'm not sure what to do about it. In theory any timeout we pick might lead to the described problems with sufficiently large roundtrip time.

wiper: After more contemplation closing this as WONTFIX.

Comment by Glebs Ivanovskis [ 2019 Apr 08 ]

I'm not sure what to do about it.

Hmm... Write a few formulas and choose a timeout which maximizes your chances of getting an established connection while minimizing chances of ending up with half-open one? My mathematical intuition suggests that 2 or 4 seconds instead of 3 would reduce the probability of ending up with half-open connection.

Generated at Wed Jul 16 09:50:03 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.