Loading...

XML

Word

Printable

Type: Problem report
Resolution: Fixed
Priority: Major
Fix Version/s: 6.0.16rc1, 6.4.2rc1, 7.0.0alpha1, 7.0 (plan)
Affects Version/s: 6.4.0beta6
Component/s: Agent (G)
Labels:
None
Environment:
Fedora 37

Sprint:
Sprint 97 (Feb 2023), Sprint 98 (Mar 2023), Sprint 99 (Apr 2023)
Story Points:
1

A while ago, when researching a bug in agent2, I reproduced it in a resource constrained environment by simply running it with under different ulimit configurations.

After reproducing the original bug, I kept changing the resource configuration to see how else the agent would fail.

The initial issue was with reaching the open file descriptor limit, so I kept going in that direction, and found that in environments with extremely low file descriptor limits, the agent fails to handle the conditions correctly and falls apart in various ways.

The above test was run with file descriptor limits 3 < n < 1024, with 3 being the standard in-out-error triplet. When given only 10 file descriptors, we let eventually let bufio read/write to a memory segment we don't own:

$ ./src/go/bin/zabbix_agent2
Starting Zabbix Agent 2 (6.4.0rc1)
Zabbix Agent2 hostname: [Zabbix server]
Press Ctrl+C to exit.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x6e9655]
goroutine 34 [running]:
bufio.(*Scanner).Scan(0xc000398738)
  /usr/lib/golang/src/bufio/scan.go:214 +0x855
zabbix.com/internal/agent/remotecontrol.(*Conn).run(0xc000382000)
  /home/jxl/scm/git/zabbix/zabbix/src/go/internal/agent/remotecontrol/remote.go:68 +0xe5
created by zabbix.com/internal/agent/remotecontrol.(*Conn).Start
  /home/jxl/scm/git/zabbix/zabbix/src/go/internal/agent/remotecontrol/remote.go:80 +0x5c
$

Similar unhandled failures occur with very low memory limits, but the socket failures usually occur sooner.

I think we should fail consistently, and just report some resource allocation failure when any of these limits are hit.

Also, I had a few runs where the process would run out of free fds, thus failing to open a socket, but would repeatedly attempt to open one anyway, all the while reporting the error. I don't think we do this in the server, proxy or the other agent, where, in case of resource allocation failure, we report the issue and then gracefully terminate (as it's anyone's guess when the resource may become available, if ever, and it's pretty pointless to sit there spinning, waiting for that to happen).

Assignee:: Vladislavs Sokurenko

Reporter:: Juris Lambda (Inactive)

Team:: Team A

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023 Feb 06 15:07

Updated:: 2024 Apr 10 16:53

Resolved:: 2023 Apr 17 15:26

Details

Description

Attachments

Activity

People

Dates