Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-17674

No way for history export module to gracefully handle backend downtime

    XMLWordPrintable

Details

    • Team A
    • Sprint 79 (Aug 2021), Sprint 80 (Sep 2021), Sprint 81 (Oct 2021), Sprint 82 (Nov 2021), Sprint 83 (Dec 2021)

    Description

      Thanks to ZBXNEXT-3353 loadable modules can be used for history export.

      Let's imagine a setup where loadable module is used to export history data from Zabbix to some sort of external storage backend in real-time:
      Zabbix → module → storage

      Let's imagine a situation when external storage is not available or has severe performance issues.
      Zabbix → module ↛ storage

      The issue is that loadable module has no way to handle this situation gracefully. History export modules are not first-class citizens in Zabbix. There is no caching/buffering and no retries implemented on Zabbix side, which would allow modules to make a pause in sending data to alternative backends and catch up later. Module has a single chance of sending data to storage backend. If backend is having performance issues, module is basically between hammer and anvil. There are following options available (and none of them is particularly good):

      1. Drop the data. Very simple solution, no performance issues. But it will cause data loss, this would undermine trust of module users and eventually no one would use it.
      2. Buffering. It is a viable solution for temporary backend performance problems, but it would significantly increase complexity of the module. And data loss could still potentially happen if Zabbix decided to stop while module had loads of buffered data. Module would need to dump buffers on disk, which would increase complexity even further.
      3. Block and wait. Another simple option. And this ensures consistency between Zabbix database and external backend. Timing out makes no sense in this strategy, module has no choice but to be fully committed. As a downside, when external storage performs poorly, Zabbix will perform poorly as well.

      I have personally chosen the latter option. In a way module's behaviour mirrors the relationship between Zabbix database and Zabbix server - when database performs poorly, Zabbix may have troubles too. That's why it is very important to monitor health status of Zabbix database and someone using module behaving like that should monitor availability of external storage as well. Preferably, with an independent monitoring setup.

      This is not ideal. I'm not saying that Zabbix should implement buffering and retries for history export, I understand that this simply moves complexity from modules to Zabbix. But since Zabbix and modules are running into the same problem, maybe same solution can be applied. I mean "database watchdog" functionality which in modern Zabbix setup is a duty of alert manager process (documented here, scroll down to "Other parameters" or search for "database down"). When watchdog detects that database is down, an alert will be sent bypassing the usual Zabbix pipeline (triggers → problems → actions → operations). Module can provide a similar "if external storage is down" check and Zabbix can send an alert based on it.

      P.S. Module users see this as a bug and since I cannot resolve this issue on my own, I am reporting this as an incident and not a feature request. Who knows, maybe you guys will have a bright idea how to resolve the issue and then it will be just a documentation task.

      Attachments

        Activity

          People

            mgeneralova Marina Generalova
            cyclone Glebs Ivanovskis
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: