Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  2. ZBX-22306

zabbix_agent2 fails variously when reaching resource limits


    • Sprint 97 (Feb 2023), Sprint 98 (Mar 2023), Sprint 99 (Apr 2023)
    • 1

      A while ago, when researching a bug in agent2, I reproduced it in a resource constrained environment by simply running it with under different ulimit configurations.

      After reproducing the original bug, I kept changing the resource configuration to see how else the agent would fail.

      The initial issue was with reaching the open file descriptor limit, so I kept going in that direction, and found that in environments with extremely low file descriptor limits, the agent fails to handle the conditions correctly and falls apart in various ways.

      The above test was run with file descriptor limits 3 < n < 1024, with 3 being the standard in-out-error triplet. When given only 10 file descriptors, we let eventually let bufio read/write to a memory segment we don't own:

      $ ./src/go/bin/zabbix_agent2
      Starting Zabbix Agent 2 (6.4.0rc1)
      Zabbix Agent2 hostname: [Zabbix server]
      Press Ctrl+C to exit.
      panic: runtime error: invalid memory address or nil pointer dereference
      [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x6e9655]
      goroutine 34 [running]:
        /usr/lib/golang/src/bufio/scan.go:214 +0x855
        /home/jxl/scm/git/zabbix/zabbix/src/go/internal/agent/remotecontrol/remote.go:68 +0xe5
      created by zabbix.com/internal/agent/remotecontrol.(*Conn).Start
        /home/jxl/scm/git/zabbix/zabbix/src/go/internal/agent/remotecontrol/remote.go:80 +0x5c

      Similar unhandled failures occur with very low memory limits, but the socket failures usually occur sooner.

      I think we should fail consistently, and just report some resource allocation failure when any of these limits are hit.

      Also, I had a few runs where the process would run out of free fds, thus failing to open a socket, but would repeatedly attempt to open one anyway, all the while reporting the error. I don't think we do this in the server, proxy or the other agent, where, in case of resource allocation failure, we report the issue and then gracefully terminate (as it's anyone's guess when the resource may become available, if ever, and it's pretty pointless to sit there spinning, waiting for that to happen).

            vso Vladislavs Sokurenko
            jlambda Juris Lambda
            Team A
            1 Vote for this issue
            6 Start watching this issue