Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  2. ZBX-24574

HA node flipping between standby and active states


    • S24-W24/25
    • 1

      When using Native HA in combination with Galera Cluster, the HA manager crashes during a failover, causing it to not run. Node that should be active constantly restarts processes. 

      After some time the node starts in normal mode. This can happen from 1-2 iterations or after 20-30 minutes, I could not find any dependency. 

      It seems that the problem is related to a large volume of configuration, because I was not able to repeat the problem on a test environment with a new Zabbix database. However, it exists all the time on a server with about 5 million records in the items table. 

      The problem does not depend on the number of nodes in the cluster. Even if there is only one node in the cluster and both servers use it as a database directly (without ProxySQL or similar) - the problem persists. 

      Steps to reproduce:

      1. Start the both nodes
      2. Stop the active node
      3. Observe the standby node going into the loop:


      346080:20240531:130404.728 starting HA manager
      346080:20240531:130404.850 HA manager started in standby mode
      346079:20240531:130404.850 "StandBy" node started in "standby" mode
      346391:20240531:130725.204 server #294 started [trigger housekeeper #1]
      346079:20240531:130725.204 "StandBy" node switched to "standby" mode
      346393:20240531:130725.644 starting HA manager
      346393:20240531:130725.644 HA manager started in standby mode
      346079:20240531:130739.735 "StandBy" node switched to "active" mode
      346079:20240531:130739.740 server #0 started [main process]
      346395:20240531:130739.741 server #1 started [service manager #1] 
      346709:20240531:130900.371 server #295 started [odbc poller #1]
      346079:20240531:130900.371 "StandBy" node switched to "standby" mode
      346710:20240531:130900.819 starting HA manager
      346710:20240531:130900.820 HA manager started in standby mode
      346079:20240531:130914.910 "StandBy" node switched to "active" mode
      346079:20240531:130914.916 server #0 started [main process]
      346711:20240531:130914.916 server #1 started [service manager #1] 
      325719:20240531:101215.912 "StandBy" node switched to "standby" mode
      326036:20240531:101215.912 server #295 started [odbc poller #1]
      326037:20240531:101216.385 starting HA manager
      326037:20240531:101216.385 HA manager started in standby mode
      325719:20240531:101230.476 "StandBy" node switched to "active" mode

            wiper Andris Zeila
            mbuz Maksym Buz
            Team A
            0 Vote for this issue
            7 Start watching this issue