[ZBXNEXT-6923] High availability cluster for Zabbix Server Created: 2021 Sep 20 Updated: 2024 Dec 10 Resolved: 2022 Apr 21 |
|
Status: | Closed |
Project: | ZABBIX FEATURE REQUESTS |
Component/s: | Server (S) |
Affects Version/s: | None |
Fix Version/s: | 6.0.0alpha5, 6.0 (plan) |
Type: | New Feature Request | Priority: | Minor |
Reporter: | Rostislav Palivoda | Assignee: | Andris Zeila |
Resolution: | Fixed | Votes: | 13 |
Labels: | None | ||
Σ Remaining Estimate: | Not Specified | Remaining Estimate: | Not Specified |
Σ Time Spent: | Not Specified | Time Spent: | Not Specified |
Σ Original Estimate: | Not Specified | Original Estimate: | Not Specified |
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
|||||||||||||||||||||||||
Issue Links: |
|
|||||||||||||||||||||||||
Sub-Tasks: |
|
|||||||||||||||||||||||||
Team: | ||||||||||||||||||||||||||
Sprint: | Sprint 80 (Sep 2021), Sprint 81 (Oct 2021), Sprint 82 (Nov 2021), Sprint 83 (Dec 2021), Sprint 84 (Jan 2022), Sprint 85 (Feb 2022), Sprint 86 (Mar 2022), Sprint 87 (Apr 2022) | |||||||||||||||||||||||||
Story Points: | 9 |
Description |
Currently, Zabbix does not support High Availability (HA) operations out of the box. Zabbix is known to work in HA mode with the help of third-party HA clustering solutions (such as Pacemaker) but the barrier to entry (high learning curve for HA software, complicated network settings, error-prone cluster configuration) can be too high for small to medium Zabbix installations. Zabbix should offer a simple but reliable native solution that would make HA installation easy to set up and operate. It's understood that the simplicity aspect of the native Zabbix HA solution has its trade-offs and may not cover all HA requirements, especially for very big and complex installations. Thus, the native solution should be opt-in and not interfere with the third-party clustering software, should the user decide to go their own HA route. |
Comments |
Comment by Arli [ 2021 Oct 06 ] |
I've been running both Zabbix Server and Zabbix Proxy in active-passive mode using pacemaker for years now and this proposal does not provide much more value or solve any shortcomings of the mentioned setup. Zabbix Agents are already able to communicate with multiple Proxies. May I suggest to instead focus on the ways to be able to run several Server and Proxy processes simultaneously and to enable multiple Zabbix Servers to collect data from the same Proxy. Each Zabbix Server could probably even store gathered data in its own separate database and redundat data deduplication and filling gaps in data could be taken care on the frontend (see how it can already be done in Prometheus + Thanos setup) and alert deduplication could be handled on some new component (take a look at Prometheus Alertmanager concept).
|
Comment by Alexei Vladishev [ 2021 Oct 12 ] |
arli, thanks for your feedback. This solution is just a much simpler alternative of pacemaker based HA setups. The next step is to introduce load-balancing and HA for proxies in 6.2 and then true horizontal scalability, which would require significant changes of the way how Zabbix server works. It is already in the roadmap: https://www.zabbix.com/roadmap |
Comment by Andris Zeila [ 2021 Oct 22 ] |
Released
Updated general documentation:
|
Comment by Alexei Vladishev [ 2021 Oct 27 ] |
All interested in the HA solution, please grab the latest alpha or beta of Zabbix 6.0 and share your feedback. |
Comment by Dimitri Bellini [ 2021 Oct 27 ] |
Hi Alexei,
Thanks so much for this new feature!! |
Comment by Alexei Vladishev [ 2021 Oct 27 ] |
dimitri.bellini, runtime commands are not supposed to give any output, it is like sending a signal to the process to perform some task. It would be nice to extend in the future. |
Comment by Dimitri Bellini [ 2021 Oct 27 ] |
@Alexei, ok thanks so much. |
Comment by Vladislavs Sokurenko [ 2021 Oct 27 ] |
About "Ghost Cluster Node", not specifying "HANodeName" yields in empty node name. 70771:20211027:132840.862 # ID Name Address Status Last Access 70771:20211027:132840.862 1. ckv9bg6t90001llpv7rv52gr5 node 1 127.0.0.1:10051 active 2s 70771:20211027:132840.862 2. ckv9dk0se0001j0pvm47mtnh8 127.0.0.1:10051 stopped 9s Documentation should be improved to clarify that, something as: ## Option: HANodeName # The high availability cluster node name. # When empty, server is working in standalone mode and node with empty name is created. # # Mandatory: no # Default: # HANodeName= |
Comment by Brian van Baekel [ 2021 Oct 28 ] |
I've played quite a bit around with it: Love it! i have the feeling i am able to introduce a race situation: Setup: zbx1 Build the environment and confirm everything is working, now on zbx1 change the server config to something ridicous: Now stop both zabbix servers, start zbx1 with the stupid configuration (and it will crash due to 'too many connections' to the database). Next, and this needs a bit of timing, start zbx2. zbx1 is talking with the HA manager to the DB, claiming the active state and starting the processes. zbx2 is talking to the DB, knowing zbx1 is the active node, so it'll be started as standby. Due to the loop of crashes on zbx1 the timing might be just right and this situation is observed: So far, after 5-10 of those cycles the moments of checking started to drift and zbx2 became active, but i have the feeling if we tweak the number of processes that should be started (and thus the timing) we can get into an infinite loop. Absolute edge case and like said, i love this feature!
|
Comment by Alexei Vladishev [ 2021 Oct 28 ] |
brian.baekel, I will pass this information to my colleagues, thanks! |
Comment by Vladislavs Sokurenko [ 2021 Oct 28 ] |
Indeed brian.baekel it was not tailored to handle such situations when node crashes and restarts instantly due to systemd immediately restarting node again without delay I suggest that when such situation happen (crashed node with active status starts) then node should switch itself into standby mode and sleep for 10 seconds so that other nodes have a chance to take over. |
Comment by Brian van Baekel [ 2021 Oct 28 ] |
A delay would indeed fix the possible issue in 99.9% of the cases, thinking of how much that delay should be: 60907:20211028:140111.747 Zabbix Server stopped. Zabbix 6.0.0alpha5 (revision 6b9f1a4346). 62582:20211028:140121.973 Starting Zabbix Server. Zabbix 6.0.0alpha5 (revision 6b9f1a4346). 62582:20211028:140125.381 Zabbix Server stopped. Zabbix 6.0.0alpha5 (revision 6b9f1a4346). 64168:20211028:140135.469 Starting Zabbix Server. Zabbix 6.0.0alpha5 (revision 6b9f1a4346). 64168:20211028:140139.162 Zabbix Server stopped. Zabbix 6.0.0alpha5 (revision 6b9f1a4346). 65753:20211028:140149.217 Starting Zabbix Server. Zabbix 6.0.0alpha5 (revision 6b9f1a4346). (Centos 8)
So Systemd is applying a 10 second delay, than 4 secs of start/crash, 10 sec delay, etc etc. I think a 5 or 10 second delay within Zabbix might not solve this as it is the same as default systemd. I would go for something like 4,6 or 7 seconds delay so that the timing starts drifting compared to the systemd delay and the race situation is ruled out as much as possible.
Another idea (and honestly no idea if this one will work, hard to implement etc) is to let the HA manager check the DB during startup, set the state to 'active - starting' and as last process to be started during startup(+2sec delay), let it change it's state to active to confirm the Zabbix server daemon is running and not starting the processes. Can imagine this is a much bigger change though and not sure how many other edge cases this might introduce. |
Comment by Vladislavs Sokurenko [ 2021 Oct 28 ] |
Standby node will become active in 5 seconds when it detects that other node is stopped, so 10 seconds should be enough. Possible solution is that ha manager checks if it starts up and there are other standby node already running then sleep 10 seconds (we do not want always to sleep for 10 seconds during startup as this will slowdown startup) |
Comment by dimir [ 2021 Oct 28 ] |
Reported as |
Comment by Dimitri Bellini [ 2021 Oct 29 ] |
@Vladislavs & @Brian Why not implement some sort of "temporary node vacuum" in case of boot loop of the current "active node" and promote the available "standby node"? Maybe in the DB we need more detail step during the "Zabbix startup session", in this mode we can keep trace of how many boot repetition it's happen to that node in a specific time. |
Comment by Vladislavs Sokurenko [ 2021 Oct 29 ] |
Yes dimitri.bellini, the idea is that other standby node must be promoted, it will be implemented under |
Comment by Dimitri Bellini [ 2021 Oct 29 ] |
@Vladislavs Fantastic, thanks so much. |
Comment by Nathan Liefting [ 2021 Oct 31 ] |
Just took a look at the feature as well, looking very good so far! Solid work.
One small point I noticed is the System information and the frontend message detailing that the Zabbix server isn't running. When your Zabbix server does a failover to a different node, it will display the frontend warning and the system information cannot be read any longer.
Of course, not the most important issues but I think an important one to the general perception of Zabbix users nonetheless. They might be confused as to what the state of their Zabbix server is this way. Will there be a way to add multiple Zabbix server nodes to the System information (widget) and the warning notification? Or is there any plan for this, besides the already exsiting option to turn off the notification and choose to simply ignore the system information from a user side? |
Comment by Vladislavs Sokurenko [ 2021 Nov 01 ] |
Please set ExternalAddress nathan.liefting and remove server entry from zabbix.conf.php |
Comment by Nathan Liefting [ 2021 Nov 01 ] |
Hi Vladislavs, Wasn't sure what this parameter was for, but I see now. Thanks! Played around with it a bit and got it to work correctly which is very nice. Simply set the ExternalAddress parameter to each of the Zabbix servers it's own addresses respectively and it will use the parameter tell the frontend which of the Zabbix servers is running and what IP to use in that case.
Forgot to add: As we do not fill in any ExternalAddress per se here (which implies a VIP or public IP). How about calling this parameter NodeAddress or ClusterNodeAddress instead of ExternalAddress? ** Just a suggestion to make this more clear without needing documentation. |
Comment by Alexei Vladishev [ 2021 Nov 01 ] |
nathan.liefting, we just discussed it here. NodeAddress sounds really good, a new issue was registered to address this: |
Comment by Nathan Liefting [ 2021 Nov 01 ] |
Amazing! Thanks for addressing it so promptly |
Comment by Eric Anderson [ 2021 Dec 08 ] |
Wow, finally this can replace my KB article I wrote years ago!!! https://ericsysmin.com/2016/02/18/configuring-high-availability-ha-zabbix-server-on-centos-7/ |
Comment by Alexei Vladishev [ 2021 Dec 08 ] |
ericsysmin , sorry for this! Your article is great and has helped many Zabbixers. |
Comment by Eric Anderson [ 2021 Dec 08 ] |
@Alexei, I am glad this is finally a feature. It will simplify configuration, and increase reliability. I didn't see you guys at Re:Invent this year, hopefully next year! |