[ZBX-26650] Zabbix Agent 2 Corrupts RPM Database via system.sw.packages.get Created: 2025 Jul 03 Updated: 2025 Sep 10 |
|
| Status: | Confirmed |
| Project: | ZABBIX BUGS AND ISSUES |
| Component/s: | Agent2 (G) |
| Affects Version/s: | 7.0.15 |
| Fix Version/s: | None |
| Type: | Problem report | Priority: | Trivial |
| Reporter: | Jan Prusinowski (Inactive) | Assignee: | Zabbix Development Team |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | rhel, rpm | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Sprint: | Support backlog | ||||
| Description |
|
Client is We are facing a critical issue on multiple RHEL 8 systems where the RPM database becomes corrupted after the zabbix_agent2 process executes the system.sw.packages.get item. After investigating the root cause, we followed this Red Hat knowledge base article: The kernel audit logs confirmed that zabbix_agent2 (running as root) sends SIGKILL signals to the rpm processes executing rpm -qa --queryformat ..., which leads to corruption of the RPM database (/var/lib/rpm).
{{sys.kill: zabbix_agent2(pid:1070317) called kill(2710675, SIGKILL)
sig.send: SIGKILL was sent to rpm (pid:2710675) by uid:0
kprocess.exit: rpm(pid:2710675) - Code 9 - "rpm -qa --queryformat ..."}}
Once the agent terminated the rpm processes, client observed immediate database corruption:
{{error: db5 error(30973) from dbenv>failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages database in /var/lib/rpm}}
Simultaneously, Zabbix Agent 2 logs showed:
{{check 'system.sw.packages.get' is not supported: Timeout occurred while gathering data.
[Sw] Failed to execute command 'rpm -qa', err: Command execution failed: context deadline exceeded.}}
Important context: Client has made the decision to run zabbix_agent2 as root in their environment to ensure access to all critical server components, including hardened or restricted filesystems. This was necessary for complete visibility. Agent command line: root 1070317 0.5 0.1 2229476 47392 ? Ssl Jun23 77:40 /usr/sbin/zabbix_agent2 -c /etc/zabbix/zabbix_agent2.conf Summary:
Other possibly helpfull info: [root@sldjde1221 zabbix]# zabbix_agent2 -V zabbix_agent2 (Zabbix) 7.0.14 Revision ae76e5efee9 18 June 2025, compilation time: Jun 18 2025 12:12:38, built with: go1.24.1 Plugin communication protocol version is 6.4.0 Copyright (C) 2025 Zabbix SIA License AGPLv3: GNU Affero General Public License version 3 <https://www.gnu.org/licenses/.> This is free software: you are free to change and redistribute it according to the license. There is NO WARRANTY, to the extent permitted by law. This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit (http://www.openssl.org/). Compiled with OpenSSL 1.1.1k FIPS 25 Mar 2021 Running with OpenSSL 1.1.1k FIPS 25 Mar 2021 Â Client uses the library Eclipse Paho (eclipse/paho.mqtt.golang), which is Client uses the library go-modbus (goburrow/modbus), which is Â
[root@sldjde1221 zabbix]# time rpm qa --queryformat '%{NAME},%{VERSION}%{RELEASE},%{ARCH},%{SIZE},%{BUILDTIME},%{INSTALLTIME}
n' > /dev/null
real 0m1.479s
user 0m1.369s
sys 0m0.105s
 # Ansible managed: agent.conf.j2 modified on 2025-04-30 12:04:13 by root on automation-job-146954-ggqbb # # This is a configuration file for Zabbix Agent 2 # To get more information about Zabbix, visit http://www.zabbix.com # This configuration file is "minimalized", which means all the original comments # are removed. The full documentation for your Zabbix Agent 2 can be found here: # https://www.zabbix.com/documentation/7.0/en/manual/appendix/config/zabbix_agent2 # Alias= # AllowKey= BufferSend=5 BufferSize=100 ControlSocket=/tmp/agent.sock DebugLevel=3 # DenyKey= EnablePersistentBuffer=0 ForceActiveChecksOnStart=0 HeartbeatFrequency=60 # HostInterface= # HostInterfaceItem= # HostMetadata= # HostMetadataItem= Hostname=sldjde1221 # HostnameItem= Include=/etc/zabbix/zabbix_agent2.d/*.conf Include=/etc/zabbix/zabbix_agent2.d/plugins.d/*.conf ListenIP=172.21.14.57 ListenPort=10050 LogFile=/var/log/zabbix/zabbix_agent2.log LogFileSize=100 LogType=file # PersistentBufferFile= PersistentBufferPeriod=1h PidFile=/var/run/zabbix/zabbix_agent2.pid PluginSocket=/tmp/agent.plugin.sock PluginTimeout=3 RefreshActiveChecks=120 Server=172.16.238.0/24,172.16.19.0/24 ServerActive=zabbix-np.REDACTED,slqzbx0366.REDACTED,slqzbx0367.REDACTED,slqzbx0368.REDACTED # SourceIP= # StatusPort= Timeout=3 TLSAccept=psk # TLSCAFile= # TLSCertFile= TLSConnect=psk # TLSCRLFile= # TLSKeyFile= TLSPSKFile=/etc/zabbix/zabbix_agent.psk TLSPSKIdentity=saqzabbixagent # TLSServerCertIssuer= # TLSServerCertSubject= UnsafeUserParameters=0 # UserParameter= # UserParameterDir=0 The zabbix-agent2 was installed using Ansible automation and the official Zabbix repository via dnf. zabbix-agent2-7.0.14-release1.el8.x86_64 The host is running: Red Hat Enterprise Linux release 8.10 (Ootpa) Kernel: 4.18.0-553.56.1.el8_10.x86_64 Client is not exactly sure when the issue began, as the agent was operating normally for some time. However, we recently started rolling out OS updates across several systems and noticed that the RPM database was corrupted on multiple hosts. After further investigation, including analysis with Red Hat Support, client identified that the root cause was linked to zabbix-agent2 executing the system.sw.packages.get item. The agent times out (context deadline exceeded), sends SIGKILL to the rpm process, and this results in RPM database corruption (DB_RUNRECOVERY). |
| Comments |
| Comment by Jan Prusinowski (Inactive) [ 2025 Jul 09 ] |
|
Steps to reproduce: Downloaded RHEL 8.0 at: https://access.redhat.com/downloads/content/479/ver=/rhel---8/8.0/x86_64/product-software Server=192.168.50.0/24 Hostname=Zabbix agent 2 RHEL Allowed connection in firewall: sudo firewall-cmd --add-port=10050/tcp --permanent sudo firewall-cmd --reload Zabbix server - 7.0.16 running on
After testing the item everything seems to be ok:
To speed things up I did run:
mv /usr/bin/rpm /usr/bin/rpm.real
echo -e '#!/bin/bash\nsleep 10\nexec /usr/bin/rpm.real "$@"' > /usr/bin/rpm
chmod +x /usr/bin/rpm
And from Zabbix server running over and over:
watch -n 4 "zabbix_get -s 192.168.50.232 -k system.sw.packages.get"
On the Host (over and over):
watch -n 5 "rpm -qa"
Im leavig this "for the night" to check if the corruption will occur |
| Comment by Jan Prusinowski (Inactive) [ 2025 Jul 10 ] |
|
After running about 13.000 iterations the corruption did not occur. I will try to force timeouts to see if this will make any difference. |
| Comment by Edgar Akhmetshin [ 2025 Jul 17 ] |
|
Why not to use sigterm for correct stopping and after use sigkill if it's not responding, why the first option is to use sigkill? |