-
Problem report
-
Resolution: Unresolved
-
Trivial
-
None
-
7.2.5
-
Almalinux 9.5
-
S25-W18/19, S25-W20/21
Our environment:
2 dedicated servers, 16 cores/64GB RAM, running zabbix server
1 VM, for database replication.
We are running a MariaDB SQL server with a patroni layer infront of it.
Problem:
We experienced a deadlock on our node 1, HA switched over to node 2.
/var/log/zabbix/zabbix_server.log:2888270:20250428:054630.667 [Z3005] query failed: [1213] Deadlock found when trying to get lock; try restarting transaction [commit;] /var/log/zabbix/zabbix_server.log:2888270:20250428:054630.667 ERROR: rollback without transaction. Please report it to Zabbix Team. /var/log/zabbix/zabbix_server.log:2888270:20250428:054630.667 === Backtrace: === /var/log/zabbix/zabbix_server.log:2888270:20250428:054630.668 10: /usr/sbin/zabbix_server: ha manager(zbx_backtrace+0x41) [0x55c5d4b5cb71] /var/log/zabbix/zabbix_server.log:2888270:20250428:054630.668 9: /usr/sbin/zabbix_server: ha manager(zbx_dbconn_rollback+0x10b) [0x55c5d4b44a6b] /var/log/zabbix/zabbix_server.log:2888270:20250428:054630.668 8: /usr/sbin/zabbix_server: ha manager(+0x258aa4) [0x55c5d4990aa4] /var/log/zabbix/zabbix_server.log:2888270:20250428:054630.668 7: /usr/sbin/zabbix_server: ha manager(ha_manager_thread+0x42a) [0x55c5d4992faa] /var/log/zabbix/zabbix_server.log:2888270:20250428:054630.668 6: /usr/sbin/zabbix_server: ha manager(zbx_ha_start+0x6d) [0x55c5d49949ed] /var/log/zabbix/zabbix_server.log:2888270:20250428:054630.668 5: /usr/sbin/zabbix_server: ha manager(MAIN_ZABBIX_ENTRY+0x9e8) [0x55c5d480cb28] /var/log/zabbix/zabbix_server.log:2888270:20250428:054630.668 4: /usr/sbin/zabbix_server: ha manager(zbx_daemon_start+0x145) [0x55c5d4b5dc75] /var/log/zabbix/zabbix_server.log:2888270:20250428:054630.668 3: /usr/sbin/zabbix_server: ha manager(main+0x3f5) [0x55c5d4801bb5] /var/log/zabbix/zabbix_server.log:2888270:20250428:054630.668 2: /lib64/libc.so.6(+0x29590) [0x7fb1a3629590] /var/log/zabbix/zabbix_server.log:2888270:20250428:054630.668 1: /lib64/libc.so.6(__libc_start_main+0x80) [0x7fb1a3629640] /var/log/zabbix/zabbix_server.log:2888270:20250428:054630.668 0: /usr/sbin/zabbix_server: ha manager(_start+0x25) [0x55c5d4808f55]
Once the second node started running, we observed duplicate primary key errors:
307243:20250428:151039.021 query [txnlev:1] [insert into events (eventid,source,object,objectid,clock,ns,value,name,severity) values (222916608,0,0,3518246,1745845835,718767,1,'/var/log: Disk space is low (used > 80%)',2);.] 307243:20250428:151039.021 In dbconn_get_cached_nextid() table:'event_tag' num:12 307243:20250428:151039.021 End of dbconn_get_cached_nextid() table:'event_tag' [33167419:33167430] 307243:20250428:151039.021 query [txnlev:1] [insert into event_tag (eventtagid,eventid,tag,value) values (33167419,222916608,'scope','availability'),(33167420,222916608,'scope','capacity'),(33167421,222916608,'component','storage'),(33167422,222916608,'filesystem','/var/log'),(33167423,222916608,'discovery','vmware_custom'),(33167424,222916608,'env','TEST'),(33167425,222916608,'contract','P0124001096'),(33167426,222916608,'SLA','N-00-00-00-00'),(33167427,222916608,'path','path/path/path'),(33167428,222916608,'vm_id','vm-xxxxxx'),(33167429,222916608,'class','os'),(33167430,222916608,'target','linux');.] 307243:20250428:151039.021 [Z3008] query failed due to primary key constraint: [1062] Duplicate entry '33167419' for key 'PRIMARY'
The zabbix server has a ID in its cache which isnt free.
MariaDB [zabbix]> select * from event_tag where eventtagid = 33167419; +------------+-----------+--------+-----------+ | eventtagid | eventid | tag | value | +------------+-----------+--------+-----------+ | 33167419 | 226515294 | target | cisco-ios | +------------+-----------+--------+-----------+1 row in set (0.001 sec)
Last ID in database at the time at writing:
MariaDB [zabbix]> select * from event_tag order by eventtagid desc limit 10;+------------+-----------+-----------+--------------+ | eventtagid | eventid | tag | value | +------------+-----------+-----------+--------------+ | 33175870 | 222949120 | target | generic | | 33175869 | 222949120 | class | network | | 33175868 | 222949120 | component | network | | 33175867 | 222949120 | component | health | | 33175866 | 222949120 | scope | performance | | 33175865 | 222949120 | scope | availability | | 33175864 | 222949100 | target | generic | | 33175863 | 222949100 | class | network | | 33175862 | 222949100 | component | network | | 33175861 | 222949100 | component | health | +------------+-----------+-----------+--------------+ 10 rows in set (0.000 sec)
Tracing the log back to the code (https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/src/libs/zbxdb/dbmisc.c#109)
As far as i can read the code, it seems that the value is not compared to the database once its cached.
Steps to reproduce:
- Setup a HA environment
- Let node 1 die in a way that it is not designed to do so
- Observe the errors in node 2
- In this case we increased the loglevel of several components
- A switch back with a systemctl to the other node fixes the problem
Result:
Duplicate primary key errors.
Expected:
See attached log/text file for more debugging info.