Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-25417

Feasibility of Zero Downtime Upgrade

XMLWordPrintable

    • Icon: Problem report Problem report
    • Resolution: Commercial support required
    • Icon: Trivial Trivial
    • None
    • 7.0.3
    • None

      Hi Zabbix Team,

      Our organization is planning to upgrade our Zabbix installs from 6.0.13 to 7.0.1 across 3 separate environments (Dev, Stage, and Prod).

      Our environment setup primarily consists of custom Docker images based on Zabbix 7.0 Dockerfiles as per the below which each service running in a separate Docker container.

      • zabbix-proxy (proxy-sqlite3), single container instance
      • zabbix-server (server-mysql), single container instance/standalone, HA is not enabled
      • zabbix-frontend (web-nginx-mysql), single container instance
      • zabbix-agent (agent - not agent2), single container instance
      • AWS RDS Aurora/MySQL 8.0 db backend (cluster - 1 writer (Dev/Stage environments), 1 writer/1 reader (Prod environment)

      NOTE: The Prod environment has multiple proxies across multiple AWS regions.

       

      The issue we are having is determining how to best perform zero time upgrades across our environments.

      During our past upgrade cycle in which we upgraded from Zabbix 4.4 to 6.0.13, we were able to leverage AWS blue/green deployments for the db backend which allowed us to accomplish the below:

      1) Have an active blue cluster that the environment zabbix server instance pointed to

      2) Have a non-active green cluster with mysql replication from the blue cluster which allowed us to: 1) perform required db schema changes needed for the upgrade and 2) standup an additional isolated zabbix server instance pointed to this cluster just for the purposes of performing the db upgrade. This instance within itself had no proxy or frontend traffic directed towards it.

      3) Upon completion of the db upgrade, we then switched the green cluster over to become the active blue cluster, de-commissioned the isolated zabbix server instance, and upgraded the environment zabbix server instance and all other components (frontend, proxy, and agent) 

       

      During this upgrade cycle from 6.0.13 to 7.0.1, we again looked to leverage AWS blue/green deployments but with differing results as explained below.

      When we attempted to standup an isolated zabbix server instance pointed to the non-active green cluster to perform the database upgrade as per step 2 above, we encountered the error listed below as gathered from the zabbix server log.

        7:20241017:014653.609 Starting Zabbix Server. Zabbix 7.0.1 (revision 0543fbe).
           7:20241017:014653.609 ****** Enabled features ******
           7:20241017:014653.609 SNMP monitoring:           YES
           7:20241017:014653.609 IPMI monitoring:           YES
           7:20241017:014653.609 Web monitoring:            YES
           7:20241017:014653.609 VMware monitoring:         YES
           7:20241017:014653.609 SMTP authentication:       YES
           7:20241017:014653.609 ODBC:                      YES
           7:20241017:014653.609 SSH support:               YES
           7:20241017:014653.609 IPv6 support:              YES
           7:20241017:014653.609 TLS support:               YES
           7:20241017:014653.609 ******************************
           7:20241017:014653.609 using configuration file: /etc/zabbix/zabbix_server.conf
           7:20241017:014653.654 current database version (mandatory/optional): 06000000/06000018
           7:20241017:014653.654 required mandatory version: 07000000
           7:20241017:014653.654 mandatory patches were found
           7:20241017:014653.664 cannot perform database upgrade: node "<standalone server>" is still running, if node is unreachable it will be skipped in 58s
           7:20241017:014653.665 Zabbix Server stopped. Zabbix 7.0.1 (revision 0543fbe).
      • Upon investigation, we isolated that this behavior was the result of the environment zabbix server instance already having a connection to the db and to resolve this, we shutdown the environment zabbix server instance and stood the isolated zabbix server instance back up which allowed the database upgrade to proceed successfully. Once the database upgrade completed successfully, we then shutdown the isolated zabbix server instance and brought the environment zabbix server instance back online.
      • We also understand that with Zabbix 6.0.x and above, only a single Zabbix server instance can be active due to HA (high availability) functionality even if HA is disabled and the Zabbix server instances are operating in standalone mode. Also, regardless of whether HA is enabled or not or whether the mode of the Zabbix server instance is standalone or in a HA mode (active/standby), the Zabbix server instance runs an HA manager process which detects whether there is a single Zabbix server instance active.
      • Even in HA mode, the active/passive Zabbix server instances must be stopped and one and only of them can be started back up in standalone mode to perform the database upgrade so: 1) downtime would still be incurred while the database upgrade is in progress and 2) even if you could have the active/passive Zabbix server instances up and running, you wouldn't be able to leverage the standby instance to perform the database upgrade as it only runs the HA manager process as a standby node.

       

      Overall, shutting down our environment Zabbix server instance to standup an isolated Zabbix server instance to perform the database upgrade in the non-active green cluster before promoting this cluster as active and then upgrading the Zabbix components (frontend, proxy, and agent) may be viable in our Stage and Dev environments because the Zabbix db is small (~10 GB's in Stage, ~30 GB's in Dev) so Zabbix server downtime would be minimal (estimate 5 - 20 mins.).

       

      However, the dataset in our Prod environment is much larger (175 GB's - 200 GB's) and this process may not be suitable. Also, whether we went the AWS blue/green deployment route or just decided to upgrade the active Production Zabbix db directly, the Zabbix server instance would still be unavailable until the database upgrade runs till completion which we estimate to be ~2-3 hours in our Prod environment.

       

      We do understand that the Zabbix proxies (6.0.13) in the environment will store data sent from our monitored devices until the Zabbix server is available so this does minimize the risk of data loss. But, with the Zabbix server being unavailable during the database upgrade what kind of degraded functionality can we expect - degraded web UI functionality, visibility of graph data until the Zabbix server is available and the proxies send the stored data, etc.?

       

      Also, our Prod dataset is still relatively small at (175 GB's - 200 GB's) so what do customers with terabytes/petabytes of data do? Do these customers leverage some form of multi master setup for the db to perform zero downtime Zabbix upgrades?

       

      Any guidance that can be provided on how we can best achieve a zero downtime upgrade (or close to) would be appreciated.

       

      Thanks in advance!

            Unassigned Unassigned
            alghani Tariq Nuriddin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: