[#ZBXNEXT-6923] High availability cluster for Zabbix Server

[ZBXNEXT-6923] High availability cluster for Zabbix Server Created: 2021 Sep 20 Updated: 2024 Dec 10 Resolved: 2022 Apr 21
Status:	Closed
Project:	ZABBIX FEATURE REQUESTS
Component/s:	Server (S)
Affects Version/s:	None
Fix Version/s:	6.0.0alpha5, 6.0 (plan)

Type:

New Feature Request

Priority:

Minor

Reporter:

Rostislav Palivoda (Inactive)

Assignee:

Andris Zeila

Resolution:

Fixed

Votes:

Labels:

None

Σ Remaining Estimate:

Not Specified

Remaining Estimate:

Not Specified

Σ Time Spent:

Not Specified

Time Spent:

Not Specified

Σ Original Estimate:

Not Specified

Original Estimate:

Not Specified

Attachments:

50.diff

Screenshot 2022-02-07 at 16.54.41.png

Screenshot from 2021-10-19 19-26-21.png

Screenshot from 2021-10-20 11-18-02.png

Screenshot from 2021-10-20 14-38-08.png

Screenshot from 2021-10-20 15-23-25.png

Screenshot from 2021-10-20 19-10-31.png

missing_address.gif

missing_address_and_port.gif

node stops after changing failover delay.gif

part_of_log.log

zabbix_server.log

❓ ACC_ High availability cluster for Zabbix Server - v1.0draft.pdf

Issue Links:

Causes
causes	~~ZBX-20125~~	Invalid argument error in trigger test	Closed
causes	ZBX-25707	Not possible to separate audit record...	READY TO DEVELOP

Sub-Tasks:

Key

Summary

Type

Status

Assignee

ZBXNEXT-6926

High availability implementation for ...

Specification change (Sub-task)

Closed

Vladislavs Sokurenko

ZBXNEXT-6927

High availability cluster for Zabbix ...

Specification change (Sub-task)

Closed

Vladislavs Sokurenko

ZBXNEXT-6928

High availability cluster support in ...

Change Request (Sub-task)

Closed

Janis Freibergs

ZBXNEXT-7001

HA induced template changes for Zabbi...

Specification change (Sub-task)

Closed

Alexander Bakaldin

Team:

Team A

Sprint:

Sprint 80 (Sep 2021), Sprint 81 (Oct 2021), Sprint 82 (Nov 2021), Sprint 83 (Dec 2021), Sprint 84 (Jan 2022), Sprint 85 (Feb 2022), Sprint 86 (Mar 2022), Sprint 87 (Apr 2022)

Story Points:

Description

Currently, Zabbix does not support High Availability (HA) operations out of the box. Zabbix is known to work in HA mode with the help of third-party HA clustering solutions (such as Pacemaker) but the barrier to entry (high learning curve for HA software, complicated network settings, error-prone cluster configuration) can be too high for small to medium Zabbix installations.

Zabbix should offer a simple but reliable native solution that would make HA installation easy to set up and operate.

It's understood that the simplicity aspect of the native Zabbix HA solution has its trade-offs and may not cover all HA requirements, especially for very big and complex installations. Thus, the native solution should be opt-in and not interfere with the third-party clustering software, should the user decide to go their own HA route.

Comments

Comment by Arli [ 2021 Oct 06 ]

I've been running both Zabbix Server and Zabbix Proxy in active-passive mode using pacemaker for years now and this proposal does not provide much more value or solve any shortcomings of the mentioned setup.

Zabbix Agents are already able to communicate with multiple Proxies. May I suggest to instead focus on the ways to be able to run several Server and Proxy processes simultaneously and to enable multiple Zabbix Servers to collect data from the same Proxy. Each Zabbix Server could probably even store gathered data in its own separate database and redundat data deduplication and filling gaps in data could be taken care on the frontend (see how it can already be done in Prometheus + Thanos setup) and alert deduplication could be handled on some new component (take a look at Prometheus Alertmanager concept).

Comment by Alexei Vladishev [ 2021 Oct 12 ]

arli, thanks for your feedback. This solution is just a much simpler alternative of pacemaker based HA setups. The next step is to introduce load-balancing and HA for proxies in 6.2 and then true horizontal scalability, which would require significant changes of the way how Zabbix server works. It is already in the roadmap: https://www.zabbix.com/roadmap

Comment by Andris Zeila [ 2021 Oct 22 ]

Released ~~ZBXNEXT-6923~~ in:

pre-6.0.0alpha5 0044baae55

Updated general documentation:

HA cluster
Configuration files: Server, Proxy, Agent, Agent2, Windows agent
Server runtime control options
Internal checks
System information: Report, Dashboard widget
Frontend installation
What's new

Comment by Alexei Vladishev [ 2021 Oct 27 ]

All interested in the HA solution, please grab the latest alpha or beta of Zabbix 6.0 and share your feedback.

Comment by Dimitri Bellini [ 2021 Oct 27 ]

Hi Alexei,
i have tested the HA implementation of Zabbix Alpha5 and seems works as we expected, good job
I would like to suggest and report some "issue/improvement":

The new Runtime command like "ha_status,ha_remove_node,ha_set_failover_delay" do not provide a standard-output message but only on Zabbix server Log, could be nice to have the same information also on the shell.
"Ghost Cluster Node" as soon as we have installed the new Zabbix Server, to fix was easy ha_remove_node but it's strange.

Thanks so much for this new feature!!

Comment by Alexei Vladishev [ 2021 Oct 27 ]

dimitri.bellini, runtime commands are not supposed to give any output, it is like sending a signal to the process to perform some task. It would be nice to extend in the future.

Comment by Dimitri Bellini [ 2021 Oct 27 ]

@Alexei, ok thanks so much.

Comment by Vladislavs Sokurenko [ 2021 Oct 27 ]

About "Ghost Cluster Node", not specifying "HANodeName" yields in empty node name.

 70771:20211027:132840.862    #  ID                        Name                      Address                        Status      Last Access
 70771:20211027:132840.862    1. ckv9bg6t90001llpv7rv52gr5 node 1                    127.0.0.1:10051                active      2s
 70771:20211027:132840.862    2. ckv9dk0se0001j0pvm47mtnh8                           127.0.0.1:10051                stopped     9s

Documentation should be improved to clarify that, something as:

## Option: HANodeName
#	The high availability cluster node name.
#	When empty, server is working in standalone mode and node with empty name is created.
#
# Mandatory: no
# Default: 
# HANodeName=

Comment by Brian van Baekel [ 2021 Oct 28 ]

I've played quite a bit around with it: Love it!
Its easy to setup, easy to understand and clear what's going on.

i have the feeling i am able to introduce a race situation:

Setup:

zbx1
zbx2
zbxdb

Build the environment and confirm everything is working, now on zbx1 change the server config to something ridicous:
StartPollers=300
StartIPMIPollers=500
300 preprocessors etc etc.

Now stop both zabbix servers, start zbx1 with the stupid configuration (and it will crash due to 'too many connections' to the database). Next, and this needs a bit of timing, start zbx2.

zbx1 is talking with the HA manager to the DB, claiming the active state and starting the processes. zbx2 is talking to the DB, knowing zbx1 is the active node, so it'll be started as standby. Due to the loop of crashes on zbx1 the timing might be just right and this situation is observed:
-zbx1 starts and claims active
-zbx2 verifies state and becomes standby
-zbx1 crashes
-zbx1 restart, with claiming active
-zbx2 verifies state and remains standby

So far, after 5-10 of those cycles the moments of checking started to drift and zbx2 became active, but i have the feeling if we tweak the number of processes that should be started (and thus the timing) we can get into an infinite loop.

Absolute edge case and like said, i love this feature!

Comment by Alexei Vladishev [ 2021 Oct 28 ]

brian.baekel, I will pass this information to my colleagues, thanks!

Comment by Vladislavs Sokurenko [ 2021 Oct 28 ]

Indeed brian.baekel it was not tailored to handle such situations when node crashes and restarts instantly due to systemd immediately restarting node again without delay ~~bigger than failover~~ of at least 5 seconds.

I suggest that when such situation happen (crashed node with active status starts) then node should switch itself into standby mode and sleep for 10 seconds so that other nodes have a chance to take over.

Comment by Brian van Baekel [ 2021 Oct 28 ]

A delay would indeed fix the possible issue in 99.9% of the cases, thinking of how much that delay should be:

60907:20211028:140111.747 Zabbix Server stopped. Zabbix 6.0.0alpha5 (revision 6b9f1a4346). 
62582:20211028:140121.973 Starting Zabbix Server. Zabbix 6.0.0alpha5 (revision 6b9f1a4346). 
62582:20211028:140125.381 Zabbix Server stopped. Zabbix 6.0.0alpha5 (revision 6b9f1a4346). 
64168:20211028:140135.469 Starting Zabbix Server. Zabbix 6.0.0alpha5 (revision 6b9f1a4346). 
64168:20211028:140139.162 Zabbix Server stopped. Zabbix 6.0.0alpha5 (revision 6b9f1a4346). 
65753:20211028:140149.217 Starting Zabbix Server. Zabbix 6.0.0alpha5 (revision 6b9f1a4346).

(Centos 8)

So Systemd is applying a 10 second delay, than 4 secs of start/crash, 10 sec delay, etc etc. I think a 5 or 10 second delay within Zabbix might not solve this as it is the same as default systemd. I would go for something like 4,6 or 7 seconds delay so that the timing starts drifting compared to the systemd delay and the race situation is ruled out as much as possible.

Another idea (and honestly no idea if this one will work, hard to implement etc) is to let the HA manager check the DB during startup, set the state to 'active - starting' and as last process to be started during startup(+2sec delay), let it change it's state to active to confirm the Zabbix server daemon is running and not starting the processes. Can imagine this is a much bigger change though and not sure how many other edge cases this might introduce.

Comment by Vladislavs Sokurenko [ 2021 Oct 28 ]

Standby node will become active in 5 seconds when it detects that other node is stopped, so 10 seconds should be enough. Possible solution is that ha manager checks if it starts up and there are other standby node already running then sleep 10 seconds (we do not want always to sleep for 10 seconds during startup as this will slowdown startup)

Comment by dimir [ 2021 Oct 28 ]

Reported as ~~ZBX-20137~~

Comment by Dimitri Bellini [ 2021 Oct 29 ]

@Vladislavs & @Brian Why not implement some sort of "temporary node vacuum" in case of boot loop of the current "active node" and promote the available "standby node"? Maybe in the DB we need more detail step during the "Zabbix startup session", in this mode we can keep trace of how many boot repetition it's happen to that node in a specific time.

Comment by Vladislavs Sokurenko [ 2021 Oct 29 ]

Yes dimitri.bellini, the idea is that other standby node must be promoted, it will be implemented under ~~ZBX-20137~~

Comment by Dimitri Bellini [ 2021 Oct 29 ]

@Vladislavs Fantastic, thanks so much.

Comment by Nathan Liefting [ 2021 Oct 31 ]

Just took a look at the feature as well, looking very good so far! Solid work.

One small point I noticed is the System information and the frontend message detailing that the Zabbix server isn't running. When your Zabbix server does a failover to a different node, it will display the frontend warning and the system information cannot be read any longer.

Of course, not the most important issues but I think an important one to the general perception of Zabbix users nonetheless. They might be confused as to what the state of their Zabbix server is this way.

Will there be a way to add multiple Zabbix server nodes to the System information (widget) and the warning notification? Or is there any plan for this, besides the already exsiting option to turn off the notification and choose to simply ignore the system information from a user side?

Comment by Vladislavs Sokurenko [ 2021 Nov 01 ]

Please set ExternalAddress nathan.liefting and remove server entry from zabbix.conf.php

Comment by Nathan Liefting [ 2021 Nov 01 ]

Hi Vladislavs,

Wasn't sure what this parameter was for, but I see now. Thanks!

Played around with it a bit and got it to work correctly which is very nice. Simply set the ExternalAddress parameter to each of the Zabbix servers it's own addresses respectively and it will use the parameter tell the frontend which of the Zabbix servers is running and what IP to use in that case.

Forgot to add: As we do not fill in any ExternalAddress per se here (which implies a VIP or public IP). How about calling this parameter NodeAddress or ClusterNodeAddress instead of ExternalAddress? ** Just a suggestion to make this more clear without needing documentation.

Comment by Alexei Vladishev [ 2021 Nov 01 ]

nathan.liefting, we just discussed it here. NodeAddress sounds really good, a new issue was registered to address this: ~~ZBX-20139~~

Comment by Nathan Liefting [ 2021 Nov 01 ]

Amazing! Thanks for addressing it so promptly

Comment by Eric Anderson [ 2021 Dec 08 ]

Wow, finally this can replace my KB article I wrote years ago!!! https://ericsysmin.com/2016/02/18/configuring-high-availability-ha-zabbix-server-on-centos-7/

Comment by Alexei Vladishev [ 2021 Dec 08 ]

ericsysmin , sorry for this! Your article is great and has helped many Zabbixers.

Comment by Eric Anderson [ 2021 Dec 08 ]

@Alexei, I am glad this is finally a feature. It will simplify configuration, and increase reliability. I didn't see you guys at Re:Invent this year, hopefully next year!

Generated at Sat May 24 08:00:31 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.

[ZBXNEXT-6923] High availability cluster for Zabbix Server Created: 2021 Sep 20 Updated: 2024 Dec 10 Resolved: 2022 Apr 21