[ZBX-15556] Zabbix-Server causing Database Deadlocks Created: 2019 Jan 30  Updated: 2019 Jan 30  Resolved: 2019 Jan 30

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: None
Fix Version/s: None

Type: Incident report Priority: Critical
Reporter: Tyler French Assignee: Unassigned
Resolution: Commercial support required Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 7 minimal, 10.2.8-MariaDB + Galera, NSX Load Balancer(Peristance: Source IP)



 Description   

We recently have been plauged by Deadlocks the zabbix database that appear to be caused by the server it self. Currently we cant bring the server up without it deadlocking the database shortly after. 

 

We are running Zabbix 3.4.15

Steps to reproduce:

  1. Start Zabbix-Server
  2. Wait ~10-15 mins

Result:

In the logs the first thing that seems to could have started the issue was:

121472:20190128:041311.376 [Z3005] query failed: [1047] WSREP has not yet prepared node for application use [select globalmacroid,macro,value from globalmacro]

Before that message are several slow query mesages like this:

121542:20190128:041310.270 slow query: 13.773135 sec, "begin;"
121479:20190128:041310.271 slow query: 27.404880 sec, "select h.hostid,h.host,h.name,t.httptestid,t.name,t.agent,t.authentication,t.http_user,t.http_password,t.http_proxy,t.retries,t.ssl_cert_file,t.ssl_key_file,t.ssl_key_password,t.verify_peer,t.verify_host,t.delay from httptest t,hosts h where t.hostid=h.hostid and t.nextcheck<=1548666762 and mod(t.httptestid,5)=1 and t.status=0 and h.proxy_hostid is null and h.status=0 and (h.maintenance_status=0 or h.maintenance_type=0)"

After the WSREP messages go away we started to see these messages start to pop up every few seconds:

121542:20190128:050657.904 slow query: 6.517110 sec, "commit;"
121546:20190128:050657.907 slow query: 6.511969 sec, "commit;"
121543:20190128:050657.923 slow query: 6.424848 sec, "commit;"
121493:20190128:050702.745 slow query: 3.202610 sec, "commit;"
121546:20190128:050706.736 slow query: 8.791040 sec, "commit;"
121550:20190128:050711.866 slow query: 12.031754 sec, "commit;"
121553:20190128:050713.459 slow query: 11.659751 sec, "commit;"
121551:20190128:050713.758 slow query: 11.008923 sec, "commit;"
121542:20190128:050713.760 slow query: 7.358562 sec, "update hosts set lastaccess=1548670025 where hostid=10757;
update hosts set lastaccess=1548670025 where hostid=10759;
update hosts set lastaccess=1548670025 where hostid=10761;
update hosts set lastaccess=1548670025 where hostid=10762;
update hosts set lastaccess=1548670026 where hostid=10765;
update hosts set lastaccess=1548670025 where hostid=11947;
update hosts set lastaccess=1548670025 where hostid=12067;
update hosts set lastaccess=1548670025 where hostid=12103;
update hosts set lastaccess=1548670025 where hostid=12128;
update hosts set lastaccess=1548670025 where hostid=12316;
"

Then the first Deadlock message showed up:

121542:20190128:050729.163 [Z3005] query failed: [1213] Deadlock found when trying to get lock; try restarting transaction [commit;]

After that we start to see several queries fail cause of the lock and then run again right after that as a slow query:

121549:20190128:050833.275 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set lastaccess=1548670059 where hostid=10118;
update hosts set lastaccess=1548670061 where hostid=10757;
update hosts set lastaccess=1548670061 where hostid=10759;
update hosts set lastaccess=1548670056 where hostid=10761;
update hosts set lastaccess=1548670062 where hostid=10762;
update hosts set lastaccess=1548670061 where hostid=11947;
update hosts set lastaccess=1548670061 where hostid=11955;
update hosts set lastaccess=1548670061 where hostid=12103;
update hosts set lastaccess=1548670061 where hostid=12316;

After letting that continue for a while we will eventually see:

121546:20190128:051217.189 [Z3005] query failed: [2013] Lost connection to MySQL server during query [commit;]
121546:20190128:051217.189 slow query: 299.999295 sec, "commit;"
121546:20190128:051217.189 [Z3005] query failed: [2006] MySQL server has gone away [rollback;]
121546:20190128:051217.189 cannot perform transaction rollback, connection will be reset
121548:20190128:051219.297 [Z3005] query failed: [2013] Lost connection to MySQL server during query [commit;]
121548:20190128:051219.297 slow query: 299.999652 sec, "commit;"
121548:20190128:051219.297 [Z3005] query failed: [2006] MySQL server has gone away [rollback;]
121548:20190128:051219.297 cannot perform transaction rollback, connection will be reset
121547:20190128:051222.185 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set lastaccess=1548670282 where hostid=10759;
update hosts set lastaccess=1548670291 where hostid=10762;
]

Which then a few seconds later will lead to:

121496:20190128:054229.627 [Z3001] connection to database 'zabbixdb' failed: [1040] Too many connections
121496:20190128:054229.627 Cannot connect to the database. Exiting...
121493:20190128:054229.628 [Z3001] connection to database 'zabbixdb' failed: [1040] Too many connections
121493:20190128:054229.628 Cannot connect to the database. Exiting...
121470:20190128:054229.642 One child process died (PID:121496,exitcode/signal:1). Exiting ...
121470:20190128:054231.688 syncing history data...

 



 Comments   
Comment by Alexey Pustovalov [ 2019 Jan 30 ]

Actually everything looks like you have Galera cluster + MySQL database is not tuned.

Comment by Tyler French [ 2019 Jan 30 ]

Our environment has 1 Zabbix-Server, 10 proxies and does ~1.2k NVPS. 

Comment by Alexey Pustovalov [ 2019 Jan 30 ]

Please be advised that this section of the tracker is for bug reports only.

Anyway... [1040] Too many connections and other queries quite clear show that you have performance issues + mistuned MySQL. The case you have submitted can not be qualified as one, so please reach out to [email protected] for commercial support or consultancy services. Alternatively, you can also use our IRC channel or community forum (https://www.zabbix.com/forum) for assistance. With that said, we are closing this ticket. Thank you for understanding.

Generated at Thu Apr 25 02:07:28 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.