[ZBX-24574] HA node flipping between standby and active states Created: 2024 Jun 03  Updated: 2024 Dec 27  Resolved: 2024 Jun 08

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 6.4.15
Fix Version/s: 6.0.31rc1, 6.4.16rc1, 7.0.1rc1, 7.2.0alpha1

Type: Problem report Priority: Major
Reporter: Maksym Buz Assignee: Andris Zeila
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 20.04
Galera Cluster with MariaDB 10.5


Issue Links:
Duplicate
is duplicated by ZBX-24648 HA mode fails to run standalone after... Closed
Team: Team A
Sprint: S24-W24/25
Story Points: 1

 Description   

When using Native HA in combination with Galera Cluster, the HA manager crashes during a failover, causing it to not run. Node that should be active constantly restarts processes. 

After some time the node starts in normal mode. This can happen from 1-2 iterations or after 20-30 minutes, I could not find any dependency. 

It seems that the problem is related to a large volume of configuration, because I was not able to repeat the problem on a test environment with a new Zabbix database. However, it exists all the time on a server with about 5 million records in the items table. 

The problem does not depend on the number of nodes in the cluster. Even if there is only one node in the cluster and both servers use it as a database directly (without ProxySQL or similar) - the problem persists. 

Steps to reproduce:

  1. Start the both nodes
  2. Stop the active node
  3. Observe the standby node going into the loop:

 

346080:20240531:130404.728 starting HA manager
346080:20240531:130404.850 HA manager started in standby mode
346079:20240531:130404.850 "StandBy" node started in "standby" mode
...
346391:20240531:130725.204 server #294 started [trigger housekeeper #1]
346079:20240531:130725.204 "StandBy" node switched to "standby" mode
346393:20240531:130725.644 starting HA manager
346393:20240531:130725.644 HA manager started in standby mode
346079:20240531:130739.735 "StandBy" node switched to "active" mode
346079:20240531:130739.740 server #0 started [main process]
346395:20240531:130739.741 server #1 started [service manager #1] 
...
346709:20240531:130900.371 server #295 started [odbc poller #1]
346079:20240531:130900.371 "StandBy" node switched to "standby" mode
346710:20240531:130900.819 starting HA manager
346710:20240531:130900.820 HA manager started in standby mode
346079:20240531:130914.910 "StandBy" node switched to "active" mode
346079:20240531:130914.916 server #0 started [main process]
346711:20240531:130914.916 server #1 started [service manager #1] 
...
325719:20240531:101215.912 "StandBy" node switched to "standby" mode
326036:20240531:101215.912 server #295 started [odbc poller #1]
326037:20240531:101216.385 starting HA manager
326037:20240531:101216.385 HA manager started in standby mode
325719:20240531:101230.476 "StandBy" node switched to "active" mode


 Comments   
Comment by Andris Zeila [ 2024 Jun 07 ]

Released ZBX-24574 in:

  • pre-6.0.31rc1 31e1693de8b
  • pre-6.4.16rc1 d7be8fc02fa
  • pre-7.0.1rc1 7f28b7b142c
  • pre-7.2.0alpha1 bdd90f5aea9
Generated at Sun Jun 29 06:31:55 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.