Here is how agent-server/proxy communication looks on TCP implementation level:
|Time||Zabbix agent||client TCP layer||server TCP layer||Zabbix server/proxy|
|t=0||sets alarm() and calls connect()||sends SYN, changes connection state to SYN_SENT||connection in LISTEN state||in accept() call|
|t=3+s||gets SIGALRM and aborts connect()||changes connection state to CLOSED||...||...|
|t=?||...||...||receives SYN, responds with SYN/ACK, changes connection status to SYN_RECV||...|
|t=?||...||ignores received SYN/ACK||attempts several SYN/ACK retransmissions and finally (after some time) changes connection status to CLOSED||...|
If round-trip time is over 3 seconds (or the first SYN gets lost and RTT is over 2 seconds or second SYN gets lost too) server/proxy will never get an ACK response and will end up with long-living "half-open" connection. If active agent count is sufficient enough connection queue will fill up and make server completely unreachable.
The problem is that default 3 seconds timeout interacts with TCP retransmission strategy in a very destructive fashion. When "half-open" connection queue is full incoming SYN packets are simply dropped which makes chances of third SYN to become "the one" very high. And since server has virtually no time to respond to it before agent aborts connection, recovery process is very difficult (if possible) even if network gets back to normal