-
Problem report
-
Resolution: Unresolved
-
Trivial
-
7.0.9, 7.2.2, 7.2.3
-
None
-
Container images of all favors from https://github.com/zabbix/zabbix-docker or installation from RPM (RHEL 8 and 9 servers).
Zabbix server with/without proxy involved.
Large corporate network, hosts with dual stack enabled (IPv4/IPv6), however IPv4 only used in reality, DNS system replies with NOERROR and an empty response section in response to AAAA queries. DNS system replies with valid IPv4 addresses and TTL set between 10 and 15 minutes (depends on site and type of host in question) due to DHCP involved.Container images of all favors from https://github.com/zabbix/zabbix-docker or installation from RPM (RHEL 8 and 9 servers). Zabbix server with/without proxy involved. Large corporate network, hosts with dual stack enabled (IPv4/IPv6), however IPv4 only used in reality, DNS system replies with NOERROR and an empty response section in response to AAAA queries. DNS system replies with valid IPv4 addresses and TTL set between 10 and 15 minutes (depends on site and type of host in question) due to DHCP involved.
-
S25-W8/9
-
3
Steps to reproduce:
Clean installation of 7.2.x in any form (rpm, spin up container)
Result:
Hosts with passive agent checks go up and down randomly not collecting items as expected.
Expected:
Stable status of hosts and collection of passive agent items
We made an attempt to upgrade from 6.4 to 7.2 but had to rollback due to this unfortunate experience. We are running sites with ~2K hosts monitored in this way.
We were also able to narrow down the root cause of the problem (at least as we believe):
This problem shows up for both passive agent checks as well as for SNMP checks, however for SNMP checks it resolves itself after a period of time as these devices are usually added with static IPv4 assignments.
For DHCP allocated hosts this problem is really relevant and we were not able to figure out a workaround neither from DNS side, nor by tweaking OS/container setup.
From DNS side we observe A and AAAA queries sent in parallel (which is normal) and responses sent back in no specific order. On Zabbix side we can see loads of log entries like this:
Zabbix agent item "system.cpu.util" on host "****" failed: first network error, wait for 15 seconds
Querying is done over UDP thus we have no control on the order of messages being delivered and processed on Zabbix side.
Using tcpdump we were able to figure out that failure happens most of the time when AAAA records get delivered before A records. This is where DHCP helped isolate the cause: AAAA responses are delivered instantly as they are cached with long TTLs (IPv6 addresses do not exist for hosts), while IPv4 addresses require more time to resolve when cache is invalidated due to TTL expiration. It is in this moment that it is clearly visible how hosts toggle their state from online to unavailable. Otherwise it happens fairly randomly depending on which replies get delivered first.
Tweaking responses to AAAA queries to be NXDOMAIN clearly causes host down following standard. Tweaking OS/container to disable IPv6 in kernel has no effect as libevent's implementation of getaddrinfo clearly disregards this. Adjusting getaddrinfo-allow-skew (libevent's invention ) option in /etc/resolv.conf to tolerate bigger delays between AAAA and A responses does not have an effect either.
On top of that looking into async_dns_event code (asyncpoller.c) and adding simple logging guards reveals that libevent always delivers only a single evutil_addrinfo in ai parameter as opposed to potentially expected 2: one for AAAA and another one for A records.
Would it be possible to either have an option to refrain from using libevent's implementation of getaddrinfo (use OS implementation) or implement using evdns_base_resolve_ipv4/ipv6 in a similar manner it is done for reverse lookup (as seen in async_event implementation in the same source file)?
- duplicates
-
ZBX-25899 Monitoring issues with IPv6 when IPv4 is not available
-
- Closed
-
- is duplicated by
-
ZBX-24572 Host agent dont work with DNS name settings
-
- Open
-
- part of
-
ZBXNEXT-1002 dns caching by zabbix daemons
-
- Open
-
- related to
-
ZBXNEXT-1275 Use c-ares for DNS resolving
-
- Open
-
-
ZBX-24572 Host agent dont work with DNS name settings
-
- Open
-