Unexpected Behavior During Zabbix Native HA , when master process killed

XMLWordPrintable

    • Type: Problem report
    • Resolution: Unresolved
    • Priority: Major
    • None
    • Affects Version/s: 7.0.22, 7.4.6, 8.0.0alpha1
    • Component/s: Server (S)
    • Support backlog

      Action

      Killing zabbix_server main process. 

      Expected 

      zabbix server switch to standby node

      Observed

      zabbix server does not respond to runtime commends and does not switch to a secondary node. At least some data continue to come in (not tested how long).

      Replication:

      Installed two clean install ubuntu servers with latest patches
      192.168.196.23 ha1.local ha1
      192.168.196.24 ha2.local ha2

       

      # lsb_release -a
      No LSB modules are available.
      Distributor ID: Ubuntu
      Description:    Ubuntu 24.04.3 LTS
      Release:        24.04
      Codename:       noble
      

      Standart 7.0.22. installation on ubuntu , frontend and DB on ha1 for testing.

      On servers zabbix 7.0.22 is installed
      ha1 packages

      zabbix-agent/zabbix,now 1:7.0.22-1+ubuntu24.04 amd64 [installed]
      zabbix-frontend-php/zabbix,now 1:7.0.22-1+ubuntu24.04 all [installed]
      zabbix-nginx-conf/zabbix,now 1:7.0.22-1+ubuntu24.04 all [installed]
      zabbix-release/zabbix,zabbix,now 1:7.0-2+ubuntu24.04 all [installed]
      zabbix-server-mysql/zabbix,now 1:7.0.22-1+ubuntu24.04 amd64 [installed]
      zabbix-sql-scripts/zabbix,now 1:7.0.22-1+ubuntu24.04 all [installed]

      ha2 packages

      zabbix-agent/zabbix,now 1:7.0.22-1+ubuntu24.04 amd64 [installed]
      zabbix-release/zabbix,zabbix,now 1:7.0-2+ubuntu24.04 all [installed]
      zabbix-server-mysql/zabbix,now 1:7.0.22-1+ubuntu24.04 amd64 [installed]
      

      Servers configured in HA mode
      ha1 

      HANodeName=ha1
      NodeAddress=192.168.196.23:10051
      

      ha2

      HANodeName=ha2
      NodeAddress=192.168.196.24:10051 
      

       

      Processes on ha1

      systemd-+-ModemManager---3*[{ModemManager}]
              |-agetty
              |-cron
              |-dbus-daemon
              |-mariadbd---79*[{mariadbd}]
              |-multipathd---6*[{multipathd}]
              |-nginx---7*[nginx]
              |-php-fpm8.3---9*[php-fpm8.3]
              |-polkitd---3*[{polkitd}]
              |-rsyslogd---3*[{rsyslogd}]
              |-snmpd
              |-sshd---sshd---sshd---bash---tmux: client
              |-systemd-+-(sd-pam)
              |         `-dbus-daemon
              |-systemd-journal
              |-systemd-logind
              |-systemd-network
              |-systemd-resolve
              |-systemd-timesyn---{systemd-timesyn}
              |-systemd-udevd
              |-tmux: server---bash---sudo---sudo---su---bash---pstree
              |-udisksd---5*[{udisksd}]
              |-unattended-upgr---{unattended-upgr}
              `-zabbix_server-+-45*[zabbix_server]
                              |-zabbix_server---16*[{zabbix_server}]
                              |-zabbix_server---5*[{zabbix_server}]
                              `-4*[zabbix_server---{zabbix_server}] 

      Processes on ha2

      systemd-+-ModemManager---3*[{ModemManager}]
              |-agetty
              |-cron
              |-dbus-daemon
              |-multipathd---6*[{multipathd}]
              |-polkitd---3*[{polkitd}]
              |-rsyslogd---3*[{rsyslogd}]
              |-snmpd
              |-sshd---sshd---sshd---bash---tmux: client
              |-systemd-+-(sd-pam)
              |         `-dbus-daemon
              |-systemd-journal
              |-systemd-logind
              |-systemd-network
              |-systemd-resolve
              |-systemd-timesyn---{systemd-timesyn}
              |-systemd-udevd
              |-tmux: server---bash---sudo---sudo---su---bash---pstree
              |-udisksd---5*[{udisksd}]
              |-unattended-upgr---{unattended-upgr}
              `-zabbix_server---zabbix_server 

      Zabbix server shows working HA with Failover delay 60 seconds

      root@ha1:~# zabbix_server -R ha_status
      Failover delay: 60 seconds
      Cluster status:
         #  ID                        Name                      Address                        Status      Last Access
         1. cmlfdgn8f0001b8819kmay2tp ha1                       192.168.196.23:10051           active      1s
         2. cmlfdgub50001le82yj2ulgwr ha2                       192.168.196.24:10051           standby     5s
      

      Find and kill active zabbix server process

      root@ha1:~# ps -ef|grep "zabbix_server -c"
      zabbix      1992       1  0 14:51 ?        00:00:00 /usr/sbin/zabbix_server -c /etc/zabbix/zabbix_server.conf
      root        2110    1291  0 14:54 pts/2    00:00:00 grep --color=auto zabbix_server -c
      root@ha1:~# kill -9 1992
      

      Processes now are owned by systemd

      /systemd-+-ModemManager---3*[{ModemManager}]
              |-agetty
              |-cron
              |-dbus-daemon
              |-mariadbd---79*[{mariadbd}]
              |-multipathd---6*[{multipathd}]
              |-nginx---7*[nginx]
              |-php-fpm8.3---10*[php-fpm8.3]
              |-polkitd---3*[{polkitd}]
              |-rsyslogd---3*[{rsyslogd}]
              |-snmpd
              |-sshd---sshd---sshd---bash---tmux: client
              |-systemd-+-(sd-pam)
              |         `-dbus-daemon
              |-systemd-journal
              |-systemd-logind
              |-systemd-network
              |-systemd-resolve
              |-systemd-timesyn---{systemd-timesyn}
              |-systemd-udevd
              |-tmux: server---bash---sudo---sudo---su---bash---pstree
              |-udisksd---5*[{udisksd}]
              |-unattended-upgr---{unattended-upgr}
              |-45*[zabbix_server]
              |-zabbix_server---16*[{zabbix_server}]
              |-zabbix_server---5*[{zabbix_server}]
              `-4*[zabbix_server---{zabbix_server}]
       

       

      Start waiting period , zabbix server does not switch to secondary node . HA remains in standby

      root@ha1:~# date
      Tue Feb 10 02:55:06 PM UTC 2026
      root@ha1:~# zabbix_server -R ha_status
      zabbix_server [2125]: Cannot perform runtime control command: Timeout while waiting for response
      

      after three minutes waiting , nothing changes

      root@ha1:~# date
      Tue Feb 10 02:58:06 PM UTC 2026
      root@ha1:~# zabbix_server -R ha_status
      zabbix_server [2145]: Cannot perform runtime control command: Timeout while waiting for response
      

      no changes on secondary node too

      root@ha2:~# date
      Tue Feb 10 03:00:14 PM UTC 2026
      root@ha2:~# zabbix_server -R ha_status
      Runtime commands can be executed only in active mode

       

      meanwhile zabbix server writes messages in zabbix server log in ha1

      2001:20260210:150158.744 cannot send history syncer notification
      2001:20260210:150216.760 cannot write to IPC socket: Broken pipe
      2001:20260210:150216.760 cannot send history syncer notification
      

       

      Service does not react on restart , processes can be only killed. 
      After killing processes on ha1 and starting zabbix server on ha1 again, have switched to second node

      root@ha2:~# zabbix_server -R ha_status
      Failover delay: 60 seconds
      Cluster status:
         #  ID                        Name                      Address                        Status      Last Access
         1. cmlfdgn8f0001b8819kmay2tp ha1                       192.168.196.23:10051           standby     3s
         2. cmlfdgub50001le82yj2ulgwr ha2                       192.168.196.24:10051           active      2s

         
      in server logfile is message

       857:20260210:150823.947 "ha2" node switched to "active" mode

            Assignee:
            Zabbix Development Team
            Reporter:
            Guntis Liepins
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - Not Specified
                Not Specified
                Logged:
                Time Spent - 1h
                1h