Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-26014

Zabbix passive agent checks fail randomly due to DNS resolution issues (libevent)

XMLWordPrintable

    • Icon: Problem report Problem report
    • Resolution: Unresolved
    • Icon: Trivial Trivial
    • 7.2.5rc1, 7.4.0alpha2 (master)
    • 7.0.9, 7.2.2, 7.2.3
    • Proxy (P), Server (S)
    • None
    • S25-W8/9
    • 3

      Steps to reproduce:

      Clean installation of 7.2.x in any form (rpm, spin up container)

      Result:
      Hosts with passive agent checks go up and down randomly not collecting items as expected.
      Expected:
      Stable status of hosts and collection of passive agent items

       

      We made an attempt to upgrade from 6.4 to 7.2 but had to rollback due to this unfortunate experience. We are running sites with ~2K hosts monitored in this way.

      We were also able to narrow down the root cause of the problem (at least as we believe):

      This problem shows up for both passive agent checks as well as for SNMP checks, however for SNMP checks it resolves itself after a period of time as these devices are usually added with static IPv4 assignments.

      For DHCP allocated hosts this problem is really relevant and we were not able to figure out a workaround neither from DNS side, nor by tweaking OS/container setup.

      From DNS side we observe A and AAAA queries sent in parallel (which is normal) and responses sent back in no specific order. On Zabbix side we can see loads of log entries like this:
      Zabbix agent item "system.cpu.util" on host "****" failed: first network error, wait for 15 seconds
      Querying is done over UDP thus we have no control on the order of messages being delivered and processed on Zabbix side.

      Using tcpdump we were able to figure out that failure happens most of the time when AAAA records get delivered before A records. This is where DHCP helped isolate the cause: AAAA responses are delivered instantly as they are cached with long TTLs (IPv6 addresses do not exist for hosts), while IPv4 addresses require more time to resolve when cache is invalidated due to TTL expiration. It is in this moment  that it is clearly visible how hosts toggle their state from online to unavailable. Otherwise it happens fairly randomly depending on which replies get delivered first.

      Tweaking responses to AAAA queries to be NXDOMAIN clearly causes host down following  standard. Tweaking OS/container to disable IPv6 in kernel has no effect as libevent's implementation of getaddrinfo clearly disregards this. Adjusting getaddrinfo-allow-skew (libevent's invention ) option in /etc/resolv.conf to tolerate bigger delays between AAAA and A responses does not have an effect either.

      On top of that looking into async_dns_event code (asyncpoller.c) and adding simple logging guards reveals that libevent always delivers only a single  evutil_addrinfo in ai parameter as opposed to potentially expected 2: one for AAAA and another one for A records.

       

      Would it be possible to either have an option to refrain from using libevent's implementation of getaddrinfo (use OS implementation) or implement  using evdns_base_resolve_ipv4/ipv6 in a similar manner it is done for reverse lookup (as seen in async_event implementation in the same source file)?

        1. rearm_timer.diff
          3 kB
        2. samplelog1.log
          332 kB
        3. samplelog2.log
          566 kB
        4. ZBX-26014-7.3.diff
          1 kB
        5. ZBX-26014-libevent-logging.diff
          0.8 kB
        6. ZBX-26014-libevent-logging-dns-only.diff
          0.9 kB

            vso Vladislavs Sokurenko
            OAMike Mike
            Team A
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: