Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-15556

Zabbix-Server causing Database Deadlocks

XMLWordPrintable

    • Icon: Incident report Incident report
    • Resolution: Commercial support required
    • Icon: Critical Critical
    • None
    • None
    • Server (S)
    • None
    • CentOS 7 minimal, 10.2.8-MariaDB + Galera, NSX Load Balancer(Peristance: Source IP)

      We recently have been plauged by Deadlocks the zabbix database that appear to be caused by the server it self. Currently we cant bring the server up without it deadlocking the database shortly after. 

       

      We are running Zabbix 3.4.15

      Steps to reproduce:

      1. Start Zabbix-Server
      2. Wait ~10-15 mins

      Result:

      In the logs the first thing that seems to could have started the issue was:

      121472:20190128:041311.376 [Z3005] query failed: [1047] WSREP has not yet prepared node for application use [select globalmacroid,macro,value from globalmacro]
      

      Before that message are several slow query mesages like this:

      121542:20190128:041310.270 slow query: 13.773135 sec, "begin;"
      121479:20190128:041310.271 slow query: 27.404880 sec, "select h.hostid,h.host,h.name,t.httptestid,t.name,t.agent,t.authentication,t.http_user,t.http_password,t.http_proxy,t.retries,t.ssl_cert_file,t.ssl_key_file,t.ssl_key_password,t.verify_peer,t.verify_host,t.delay from httptest t,hosts h where t.hostid=h.hostid and t.nextcheck<=1548666762 and mod(t.httptestid,5)=1 and t.status=0 and h.proxy_hostid is null and h.status=0 and (h.maintenance_status=0 or h.maintenance_type=0)"
      

      After the WSREP messages go away we started to see these messages start to pop up every few seconds:

      121542:20190128:050657.904 slow query: 6.517110 sec, "commit;"
      121546:20190128:050657.907 slow query: 6.511969 sec, "commit;"
      121543:20190128:050657.923 slow query: 6.424848 sec, "commit;"
      121493:20190128:050702.745 slow query: 3.202610 sec, "commit;"
      121546:20190128:050706.736 slow query: 8.791040 sec, "commit;"
      121550:20190128:050711.866 slow query: 12.031754 sec, "commit;"
      121553:20190128:050713.459 slow query: 11.659751 sec, "commit;"
      121551:20190128:050713.758 slow query: 11.008923 sec, "commit;"
      121542:20190128:050713.760 slow query: 7.358562 sec, "update hosts set lastaccess=1548670025 where hostid=10757;
      update hosts set lastaccess=1548670025 where hostid=10759;
      update hosts set lastaccess=1548670025 where hostid=10761;
      update hosts set lastaccess=1548670025 where hostid=10762;
      update hosts set lastaccess=1548670026 where hostid=10765;
      update hosts set lastaccess=1548670025 where hostid=11947;
      update hosts set lastaccess=1548670025 where hostid=12067;
      update hosts set lastaccess=1548670025 where hostid=12103;
      update hosts set lastaccess=1548670025 where hostid=12128;
      update hosts set lastaccess=1548670025 where hostid=12316;
      "
      

      Then the first Deadlock message showed up:

      121542:20190128:050729.163 [Z3005] query failed: [1213] Deadlock found when trying to get lock; try restarting transaction [commit;]
      

      After that we start to see several queries fail cause of the lock and then run again right after that as a slow query:

      121549:20190128:050833.275 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set lastaccess=1548670059 where hostid=10118;
      update hosts set lastaccess=1548670061 where hostid=10757;
      update hosts set lastaccess=1548670061 where hostid=10759;
      update hosts set lastaccess=1548670056 where hostid=10761;
      update hosts set lastaccess=1548670062 where hostid=10762;
      update hosts set lastaccess=1548670061 where hostid=11947;
      update hosts set lastaccess=1548670061 where hostid=11955;
      update hosts set lastaccess=1548670061 where hostid=12103;
      update hosts set lastaccess=1548670061 where hostid=12316;
      

      After letting that continue for a while we will eventually see:

      121546:20190128:051217.189 [Z3005] query failed: [2013] Lost connection to MySQL server during query [commit;]
      121546:20190128:051217.189 slow query: 299.999295 sec, "commit;"
      121546:20190128:051217.189 [Z3005] query failed: [2006] MySQL server has gone away [rollback;]
      121546:20190128:051217.189 cannot perform transaction rollback, connection will be reset
      121548:20190128:051219.297 [Z3005] query failed: [2013] Lost connection to MySQL server during query [commit;]
      121548:20190128:051219.297 slow query: 299.999652 sec, "commit;"
      121548:20190128:051219.297 [Z3005] query failed: [2006] MySQL server has gone away [rollback;]
      121548:20190128:051219.297 cannot perform transaction rollback, connection will be reset
      121547:20190128:051222.185 [Z3005] query failed: [1205] Lock wait timeout exceeded; try restarting transaction [update hosts set lastaccess=1548670282 where hostid=10759;
      update hosts set lastaccess=1548670291 where hostid=10762;
      ]
      

      Which then a few seconds later will lead to:

      121496:20190128:054229.627 [Z3001] connection to database 'zabbixdb' failed: [1040] Too many connections
      121496:20190128:054229.627 Cannot connect to the database. Exiting...
      121493:20190128:054229.628 [Z3001] connection to database 'zabbixdb' failed: [1040] Too many connections
      121493:20190128:054229.628 Cannot connect to the database. Exiting...
      121470:20190128:054229.642 One child process died (PID:121496,exitcode/signal:1). Exiting ...
      121470:20190128:054231.688 syncing history data...

       

            Unassigned Unassigned
            frenchtoasters Tyler French
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: