Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-5887

Housekeeper delete_history function does not respect MaxHousekeeperDelete - in slow databases results in broken DM sync

XMLWordPrintable

    • Icon: Incident report Incident report
    • Resolution: Unsupported version
    • Icon: Major Major
    • None
    • 2.0.3
    • Server (S)

      We have a 1 master, 3 slave node DM configuration and I noticed that some data from one of the slave nodes was simply not being synced - it was consistently trying to sent the same ~300K of history_sync data over and over again.

      After much investigation I discovered that the "INSERT INTO history ..." query of the sync process was unable to proceed due to long held row logs by the housekeeper performing a "DELETE FROM history WHERE itemid = ___ AND clock < ____". These queries were taking sometimes multiple 1000s of seconds to complete (we have about 40G of data in our master's history table) and holding the row locks much longer than the innodb_lock_wait_timeout default of 50 seconds - even increasing this to 10 or 15 minutes was not enough.

      I discovered the MaxHousekeeperDelete option of the Zabbix Server and implemented that at 1000 as our MySQL instance seems OK with LIMIT 1000 on the history table deletes (they complete anywhere between 0 and 5 seconds).

      However, the housekeeper delete_history function does not respect MaxHousekeeperDelete, only the housekeeping_cleanup function checks for it.

      After applying the attached patch we now have working DM sync and while the housekeeper is probably taking longer to clean up it's no longer keeping rows locked for long periods and preventing other queries on history from running.

      The patch works with MySQL however doesn't implement the alternative limiting queries for other database engines such as PostgreSQL or Oracle, but it certainly solved our problem and hopefully brings to light a potential issue of the housekeeping process for resolution in future versions.

      The patch was against 1.8.12, but it doesn't look like the code is any different in 2.0.3.

            Unassigned Unassigned
            johan.mach Johan Venter
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: