Zabbix Proxy trappers fail to gracefully handle dropped inbound connections, causing ~300s process hangs and FIN_WAIT1 pileups

XMLWordPrintable

    • Type: Incident report
    • Resolution: Unresolved
    • Priority: Major
    • None
    • Affects Version/s: 7.0.26
    • Component/s: None
    • None

      We are experiencing a critical issue where Zabbix Active Proxy trapper processes are hanging for ~300.xxx seconds when dealing with inbound connections that are silently dropped by the network/load balancer (in our case, Docker Swarm ingress mesh). This behavior ties up multiple trapper processes simultaneously, degrading monitoring capacity.

      Symptoms & Network Diagnosis: Inside the proxy container, netstat reveals that inbound connections (from the ingress endpoint to the proxy on port 10051) are getting stuck in a FIN_WAIT1 state with Send-Q=1. The proxy's FIN packet is never ACKed. The proxy to server outbound flow, however, remains completely healthy.

      Essentially, the network mesh is blackholing the inbound connection, but the Zabbix trapper process mishandles this dead peer gracefully. It holds the connection open, waiting for bytes that will never arrive. The process is blocked until it hits a ~300s timeout, at which point Zabbix attempts to close the socket, but the TCP state ends up stranded in FIN_WAIT1.

      Why we believe this is a Zabbix bug:

      1. Ineffective Timeout Enforcement: Even after verifying that TrapperTimeout is correctly configured in the environment, the trapper process still hangs for around 300 seconds. The application appears to be ignoring its own timeout logic when blocked on this specific socket read state, acting as a hardcoded/unhandled fallback.
      1. Lack of Robust Socket Handling (TCP Keepalives): Zabbix does not seem to utilize TCP Keepalives effectively on these inbound trapper sockets. If Keepalives were properly implemented and configurable for inbound proxy connections, Zabbix would detect the dead peer at the TCP level much faster, terminating the connection and freeing the trapper process instead of hanging for 5 minutes.

      Steps to Reproduce:

      1. Set up an Active Proxy behind a load balancer or ingress mesh that silently drops long-lived idle TCP connections (blackhole).
      1. Send inbound data to the proxy trapper.
      1. Observe the trapper processes in Zabbix internal checks or logs taking 300+ seconds to process.
      1. Run netstat on the proxy to observe sockets stuck in FIN_WAIT1.

      Expected Behavior: Zabbix trappers should not block indefinitely or for 300 seconds when an inbound TCP stream drops abruptly. The application should either enforce the configured TrapperTimeout strictly across all socket reads, or implement TCP Keepalives on listening sockets to detect and clear dead connections, preventing process exhaustion.

            Assignee:
            Zabbix Support Team
            Reporter:
            Lucas Frade
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: