[ZBX-22061] zabbix_agent2 crashes when hitting the open file descriptor limit Created: 2022 Dec 09  Updated: 2024 Apr 10  Resolved: 2023 Jan 23

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 6.0.12, 6.2.6, 6.4.0beta4
Fix Version/s: 6.0.13rc1, 6.2.7rc1, 6.4.0beta6, 6.4 (plan)

Type: Problem report Priority: Critical
Reporter: Edgar Akhmetshin Assignee: Eriks Sneiders
Resolution: Fixed Votes: 2
Labels: systemd
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

RHEL 8.7
LTS 6.0
PgSQL 14.6
TSDB 2.8.1
OpenSSL 1.1.1 series
Agent2 6.0.14


Issue Links:
Duplicate
Sub-task
depends on ZBX-22069 zabbix_agent2 plugin socket changes o... Closed
Team: Team INT
Sprint: Sprint 96 (Jan 2023)
Story Points: 1

 Description   

Steps to reproduce:

  1. Follow: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/db/postgresql_agent2/README.md?at=refs%2Fheads%2Frelease%2F6.2
  2. Modify with mass update all Zabbix Agent items to Zabbix Agent (Active) items;

Result:
Problem 1: open file limit, default for unit file of systemd slice is 1024 with to weird error i nthe log of the agent:

2022/12/09 11:58:45.245630 failed to read response for plugin PostgreSQL, failed to read type header, EOF

Problem 2: Zabbix Agent just keeps crashing with another weird error:

2022/12/09 12:25:04.260087 failed to clean up after plugins, operation not permitted

And if try to start back manually:

# zabbix_agent2 -c /etc/zabbix/zabbix_agent2.conf 
Starting Zabbix Agent 2 (6.0.12)
Zabbix Agent2 hostname: [Zabbix server]
Press Ctrl+C to exit.
panic: failed to obtain PID of dead child process: no child processes

goroutine 12 [running]:
main.listenOnPluginFail(0x0, {0xc0001798f0, 0x7})
    /tmp/build-rhel-8-x86_64.H7NajgV0/buildroot/BUILD/zabbix-6.0.12/src/go/cmd/zabbix_agent2/external_nix.go:96 +0x168
created by main.initExternalPlugin
    /tmp/build-rhel-8-x86_64.H7NajgV0/buildroot/BUILD/zabbix-6.0.12/src/go/cmd/zabbix_agent2/external.go:94 +0x115

And all the building process path traceback for official packages:

    /tmp/build-rhel-8-x86_64.H7NajgV0/buildroot/BUILD/zabbix-6.0.12/src/go/cmd/zabbix_agent2/

Expected:
No open file limit or at least information in the documentation - Active check will require N file descriptors for X connections or X databases.

No crash, if Zabbix Agent (Active) is used for common items of the official template.



 Comments   
Comment by pfoo [ 2022 Dec 19 ]

I'm experiencing the same issue concerning postgresql and MongoDB plugins .. even with no postgresql/mongodb templates configured :

failed to read response for plugin MongoDB, failed to read type header, EOF
failed to read response for plugin PostgreSQL, failed to read type header, EOF

 

Zabbix-agent2 refusing to (re)start however always has the same error (even with no plugin error):

monitor zabbix_agent2[3112]: Starting Zabbix Agent 2 (6.0.12)
monitor zabbix_agent2[3112]: Zabbix Agent2 hostname: [monitor]
monitor zabbix_agent2[3112]: Press Ctrl+C to exit.
monitor zabbix_agent2[3112]: panic: failed to obtain PID of dead child process: no child processes
monitor zabbix_agent2[3112]: goroutine 20 [running]:
monitor zabbix_agent2[3112]: main.listenOnPluginFail(0x0?, {0xc00017f970, 0xa})
monitor zabbix_agent2[3112]:         /tmp/build-debian-11-x86_64.4aNaMOyj/buildroot/zabbix-6.0.12/debian/tmp.build-sqlite3/src/go/cmd/zabbix_agent2/external_nix.go:96 +0x168
monitor zabbix_agent2[3112]: created by main.initExternalPlugin
monitor zabbix_agent2[3112]:         /tmp/build-debian-11-x86_64.4aNaMOyj/buildroot/zabbix-6.0.12/debian/tmp.build-sqlite3/src/go/cmd/zabbix_agent2/external.go:94 +0x110
monitor systemd[1]: zabbix-agent2.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
monitor systemd[1]: zabbix-agent2.service: Failed with result 'exit-code'. 

 

 

Comment by Andrey Tocko (Inactive) [ 2022 Dec 30 ]
  • The crash happens because runtime commands: "zabbix_agent2 -R [command]" rewrites the socket file with the permission of executing user. Most likely root. While stopping, agent2 is not able to close the socket file because of permissions.
    Current workaround: use runtime under user which runs agent2(default zabbix).
    If socket file (default: /tmp/agent.plugin.sock) is already owned by root:
    chown zabbix:zabbix /tmp/agent.plugin.sock 

    Or change plugins socket location in agent2 config file.
    Watch ZBX-22069 for progress and details.

  • Open file limit error is handled by zabbix_agent2 and should not cause a crash after it is already started. But it prevents data collection:
    2022/12/29 14:45:10.002406 check 'pgsql.replication.recovery_role["tcp://127.0.0.1:5432","zabbix","password"]' is not supported: Connection failed: failed to connect to host=127.0.0.1 user=zabbix database=postgres: dial error (dial tcp 127.0.0.1:5432: socket: too many open files). 

    Check the current limit of systemd process:

    prlimit -n -p $(pidof zabbix_agent2)
    

    Count current file usage by process:

    lsof -p $(pidof zabbix_agent2) | sed 1d | wc -l
    

    Extend limits to 4096 or more:

    mkdir -p /etc/systemd/system/zabbix-agent2.service.d
    cat >/etc/systemd/system/zabbix-agent2.service.d/filelimit.conf <<EOF
    [Service]
    LimitNOFILE=4096
    EOF
    systemctl daemon-reload
    systemctl restart zabbix-agent2
Comment by Andrey Tocko (Inactive) [ 2022 Dec 30 ]

If systemd is not used user limits are in effect.
Check the count of open files for user running agent2. By default it is "zabbix":

lsof -u zabbix | wc -l

Check the open file limit set for user. The default soft limit of open files for a user is 1024

sudo -u zabbix ulimit -Sn
sudo -u zabbix ulimit -Hn 

Most likely zabbix user already reached the allowed limit(agentd, server, proxy, java-gw), and starting of the additional process is stuck on this limit. Which can result in a crash during agent startup.
To change open file limit for specific user add file with configuration to 
/etc/security/limits.d/0-zabbix_agent2.conf

zabbix           soft    nofile          51200
zabbix           hard    nofile          51200 

Better to adjust accordingly to system/kernel as values beyond range will be substituted with defaults.
Reboot is required.

Comment by Edgar Akhmetshin [ 2023 Jan 03 ]

Or change plugins socket location in agent2 config file.

Please modify packages to use default locations for sockets defined in HFS guideline:
https://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch03s15.html

https://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch03s18.html

Comment by Oleksii Zagorskyi [ 2023 Jan 17 ]

Just FYI, what is more correct way to alter unit file in an official way, which will create an additional override file.

Let's make sure what we have:

# systemctl cat zabbix-agent2
# /usr/lib/systemd/system/zabbix-agent2.service
[Unit]
Description=Zabbix Agent 2
After=syslog.target
After=network.target

[Service]
Environment="CONFFILE=/etc/zabbix/zabbix_agent2.conf"
EnvironmentFile=-/etc/sysconfig/zabbix-agent2
Type=simple
Restart=on-failure
PIDFile=/run/zabbix/zabbix_agent2.pid
KillMode=control-group
ExecStart=/usr/sbin/zabbix_agent2 -c $CONFFILE
ExecStop=/bin/kill -SIGTERM $MAINPID
RestartSec=10s
User=zabbix
Group=zabbix

[Install]
WantedBy=multi-user.target

Then we execute this command:

# systemctl edit zabbix-agent2

which will run "vi" editor with empty contend, where we should paste these 2 lines:

[Service]
LimitNOFILE=8192

exit and save changes.

Now lets check again the unit file:

# systemctl cat zabbix-agent2
# /usr/lib/systemd/system/zabbix-agent2.service
[Unit]
Description=Zabbix Agent 2
After=syslog.target
After=network.target

[Service]
Environment="CONFFILE=/etc/zabbix/zabbix_agent2.conf"
EnvironmentFile=-/etc/sysconfig/zabbix-agent2
Type=simple
Restart=on-failure
PIDFile=/run/zabbix/zabbix_agent2.pid
KillMode=control-group
ExecStart=/usr/sbin/zabbix_agent2 -c $CONFFILE
ExecStop=/bin/kill -SIGTERM $MAINPID
RestartSec=10s
User=zabbix
Group=zabbix

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/zabbix-agent2.service.d/override.conf
[Service]
LimitNOFILE=8192

Now at end of the output we see that the override file has been created (with our 2 lines) and systemd will read it every time when working with "zabbix-agent2" unit.

Comment by Juris Lambda [ 2023 Jan 18 ]

Hey, zalex_ua!

Note though, that override.conf is a local override configuration that the administrator gets to write upon systemctl edit ... As we "own" the main systemd service configuration (the package does), we should be declaring the file descriptor limit in that.

However, both examples atocko and zalex_ua are valid for a system administrator to use for raising the limit themselves, say, in the case of a deployment of a previous version of the package.

If done on a small scale or for an single system, I'd follow zalex_ua's approach and let systemd create an override.conf for me. On a larger scale, I'd follow atocko's approach, and probably deploy an additional configuration in the service configuration directory (named anything other than override.conf) and deploy those.

Just noting this here if anyone runs into this and can't upgrade the package, and need to seek some workaround.

Comment by Eriks Sneiders [ 2023 Jan 20 ]

Fixed in:

Zabbix agent 2

Zabbix PostgreSQL plugin

Generated at Sat May 10 08:01:58 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.