[ZBXNEXT-1572] Active / Active High Availability Zabbix Created: 2013 Jan 14  Updated: 2022 May 25  Resolved: 2022 May 25

Status: Closed
Project: ZABBIX FEATURE REQUESTS
Component/s: Server (S)
Affects Version/s: 2.0.4, 3.0.29, 4.0.17, 4.4.5, 5.0.0alpha1
Fix Version/s: None

Type: New Feature Request Priority: Major
Reporter: Simon Tsang Assignee: Unassigned
Resolution: Workaround proposed Votes: 35
Labels: highavailability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux



 Description   

Currently, presently Active/Active is not possible with Zabbix without many modifications (according to the zabbix howto wiki). Is Active/Active HA on the roadmap for Zabbix in near future? It would be very nice if Zabbix can support that. Any plan for Zabbix to support Postgres-XC also?

Thank you.



 Comments   
Comment by richlv [ 2013 Jan 14 ]

to clarify, this is referring to active/active zabbix server clustering possibility ?

Comment by David Israel [ 2013 Mar 09 ]

What if the https://www.zabbix.com/documentation/2.0/manual/appendix/items/activepassive active agent configuration could be used to send the same data to two Zabbix servers? From the documentation it is not clear if the checks for both active servers would be run twice or not. If not this is a reasonable HA solution already.

Comment by Onno Steenbergen [ 2013 May 01 ]

My view is on active/active is having the possibility to run multiple servers using a single database. if there is one server then it behaves as normal. Running a second zabbix server using the same database will distribute the load. Agents can report to both servers or a roaming ip (which resolves to one server).
If one of these servers crash the other will have an increased load but the service will be online.

It should also be possible to run a distributed monitoring setup in Active/Active. The result is similar as discussed above only instead of distributing the load randomly hosts get monitored using a specific node.

My current situation:
3 node distributed setup (server - database separated so 6 servers are used)

  • node 1 checks network 1
  • node 2 checks network 2 and 3
  • node 3 is preproduction

Currently if node 2 fails I need to restore the virtual machine from a backup which results in downtime. Also if a device in preproduction is promoted it needs to be removed from node 3 and added to 1 or 2 losing all data.

The ideal situation:
3 server database cluster which provides High Availability on database level
3 (or more nodes) zabbix cluster that uses the database cluster to store all data

The database has a list of all hosts/items and a list of preferred nodes for each host. The zabbix servers monitor each other with keep-a-lives and if all preferred nodes are unavailable an other node tries to monitor the hosts. Will need a second list to assign nodes that replace an other node (A can be replace by B&C but not by D&E).

On a side note: Some DB clusters, such as PostgreSQL, do not support Active/Active but support read-only backup nodes. Frontend could use read-only DB nodes to reduce the load on the DB server. Other tweaks to reduce the number of connections to the DB are probably necessary.

To summarize:
Nodes need to be able to replace other nodes without downtime and distribute the monitoring load among a group of nodes.

Comment by Murat KoƧ [ 2013 May 01 ]

Use galera ( http://codership.com/content/using-galera-cluster ) with mysql or oracle RAC ( if you have enough money ).

And I suggest to use haproxy in front of galera cluster for both distributing load and showing only one database IP to the zabbix servers.

We use galera in different kinds of production systems and happy with it

Since you are using virtual machines you can clone the vm and use it as a failover virtual machine.

I think that setup will solve all of your problems.

Comment by Onno Steenbergen [ 2013 May 15 ]

Replacing the DB by a master-master cluster doesn't solve all issues.

  • Not able to move hosts from one node to the other (Distributed Monitoring)
  • Need to use VM cloning, which results in a longer downtime (detect issue, deploy new vm, wait for start-up)
  • Active-Active Database technology is likely to fail during split-brain scenarios (and during these situations having access to the monitoring system is crucial). A node must use it's own database on the cluster and not share a single large database. If a split brain occurs the database is always accessible and as only one node writes to the DB there is no problem when the split brain is resolved.

I know active-active databases have mediators to resolve split brains, but if you have two data centers and there is no connection between them the database is only active in the one where the mediator is. Now you don't have any monitoring in the other data center as it cannot store any data.

Comment by Onno Steenbergen [ 2013 May 15 ]

Maybe it is easier to assume that each zabbix node has its own DB and it needs to be able to sync with other nodes. And in case of failure a the remaining zabbix nodes divide the labour if the node can reach that network.

Comment by Sol Arioto [ 2013 Nov 07 ]

Where are we with this? this is a major upset with upgrading and maintenance down times not having a Active / Active Solution

Comment by jagadeeswar Reddy [ 2017 Feb 06 ]

high availability zabbix cluster when every components for this system should be failed over when issues comes up

Comment by Oleksii Zagorskyi [ 2022 May 25 ]

In 6.0 we have some clustering solution implemented, so many things discussed here are not actual anymore. If needed, new requests should be created.

Let's close this one.

Generated at Fri Jun 06 19:33:53 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.