Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-10851

Default timeout for active agent connect phase may result in a mild DoS of server/proxy over slow networks

XMLWordPrintable

    • Sprint 50 (Mar 2019), Sprint 51 (Apr 2019)
    • 0

      Here is how agent-server/proxy communication looks on TCP implementation level:

      Time Zabbix agent client TCP layer server TCP layer Zabbix server/proxy
      t=0 sets alarm() and calls connect() sends SYN, changes connection state to SYN_SENT connection in LISTEN state in accept() call
      t=1s ... re-sends SYN ... ...
      t=3s ... re-sends SYN ... ...
      t=3+s gets SIGALRM and aborts connect() changes connection state to CLOSED ... ...
      t=? ... ... receives SYN, responds with SYN/ACK, changes connection status to SYN_RECV ...
      t=? ... ignores received SYN/ACK attempts several SYN/ACK retransmissions and finally (after some time) changes connection status to CLOSED ...

      If round-trip time is over 3 seconds (or the first SYN gets lost and RTT is over 2 seconds or second SYN gets lost too) server/proxy will never get an ACK response and will end up with long-living "half-open" connection. If active agent count is sufficient enough connection queue will fill up and make server completely unreachable.

      The problem is that default 3 seconds timeout interacts with TCP retransmission strategy in a very destructive fashion. When "half-open" connection queue is full incoming SYN packets are simply dropped which makes chances of third SYN to become "the one" very high. And since server has virtually no time to respond to it before agent aborts connection, recovery process is very difficult (if possible) even if network gets back to normal

            wiper Andris Zeila
            glebs.ivanovskis Glebs Ivanovskis (Inactive)
            Team A
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: