ncident Report: Zabbix Server transient DB loss during MySQL restart

XMLWordPrintable

    • Type: Problem report
    • Resolution: Unresolved
    • Priority: Trivial
    • None
    • Affects Version/s: 8.0.0beta1
    • Component/s: Server (S)
    • None

      🐞 Zabbix Server transient mass β€œnot supported (database error)” state during MySQL restart

      Summary:
      During a short MySQL restart, Zabbix Server (8.0.0 beta1) temporarily marks a large number of items and discovery rules as β€œnot supported (database error)”, even though database connectivity is restored automatically within seconds and the server fully recovers without restart.

      This leads to unnecessary monitoring noise and temporary instability in item/discovery state consistency during normal database maintenance operations.

      Environment:

      • Zabbix Server: 8.0.0 beta1
      • OS: Ubuntu Server 26.04(systemd)
      • Database: MySQL 8.4.x
      • DB connection: local UNIX socket (/var/run/mysqld/mysqld.sock)
      • Deployment: single-node
      • Service manager: systemd

      Steps to reproduce:
      1. Start Zabbix Server under normal load
      2. Restart MySQL:
      Β  Β systemctl restart mysql
      3. Observe Zabbix Server behavior during DB outage and recovery

      Actual behavior:

      • Immediate DB loss detected:
        Β  [Z3001] connection to database 'zabbix' failed
        Β  Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock'
        Β  database is down: reconnecting in 10 seconds
      • During DB downtime:
        Β  multiple items and discovery rules become β€œnot supported (database error)”
      • After MySQL recovery:
        Β  database connection re-established
        Β  all items and discovery rules return to supported state automatically
      • No Zabbix Server restart is triggered; recovery is automatic

      Expected behavior:
      Short DB unavailability (such as MySQL restart) should not cause mass transitions of items/discovery rules into β€œnot supported” state. A short grace period or suppression of transient DB failure propagation is expected.

      Impact:

      • No data loss
      • No Zabbix server crash
      • No manual intervention required
      • However:
        Β  - temporary large-scale monitoring noise
        Β  - state flapping across many items
        Β  - reduced clarity during maintenance windows

      Frequency:

      • 100% reproducible during MySQL restart
      • Window: ~10–20 seconds

      Severity:
      Medium (operational noise / observability instability, not functional failure)

      ──────────────────────────────────────────────
      Systemd configuration verification (mitigation test)
      ──────────────────────────────────────────────

      During investigation, systemd configuration was validated to ensure the issue is not caused by service dependency propagation or restart ordering.

      Final tested systemd unit used during reproduction:

      [Unit]
      Description=Zabbix Server
      After=network.target mysql.service mysqld.service mariadb.service
      Wants=mysql.service

      [Service]
      Environment="CONFFILE=/etc/zabbix/zabbix_server.conf"
      EnvironmentFile=-/etc/default/zabbix-server

      Type=forking
      PIDFile=/run/zabbix/zabbix_server.pid

      ExecStart=/usr/sbin/zabbix_server -c $CONFFILE
      ExecStop=/bin/sh -c '[ -n "$MAINPID" ] && kill -TERM "$MAINPID"'

      Restart=on-failure
      RestartSec=10s

      TimeoutStartSec=infinity
      TimeoutStopSec=infinity

      KillMode=control-group

      LimitNOFILE=65536:1048576

      StartLimitIntervalSec=30s
      StartLimitBurst=5

      Result of systemd verification:

      • Zabbix Server does NOT stop when MySQL restarts
      • No StopPropagatedFrom or dependency cascade behavior is involved
      • No systemd-triggered restart occurs
      • DB outage handling is fully internal to Zabbix process logic

      Conclusion:
      The issue is independent of systemd configuration. It is caused by Zabbix internal database reconnection handling, which propagates short DB outages into mass β€œnot supported” state transitions.

      Suggested improvement:

      • Introduce configurable DB outage grace period (e.g. 5–10 seconds)
      • Suppress transient unsupported state transitions during short DB outages
      • Improve DB reconnect state handling to reduce monitoring noise during maintenance windows

            Assignee:
            Zabbix Support Team
            Reporter:
            Alex
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: