Loading...

Type: Problem report
Resolution: Fixed
Priority: Trivial
Fix Version/s: 7.0.11rc1, 7.2.5rc1, 7.4.0beta1
Affects Version/s: 7.0.9, 7.2.2, 7.2.3
Component/s: Proxy (P), Server (S)
Labels:
None
Environment:

Hide
Container images of all favors from https://github.com/zabbix/zabbix-docker or installation from RPM (RHEL 8 and 9 servers).
Zabbix server with/without proxy involved.
Large corporate network, hosts with dual stack enabled (IPv4/IPv6), however IPv4 only used in reality, DNS system replies with NOERROR and an empty response section in response to AAAA queries. DNS system replies with valid IPv4 addresses and TTL set between 10 and 15 minutes (depends on site and type of host in question) due to DHCP involved.

Show
Container images of all favors from https://github.com/zabbix/zabbix-docker or installation from RPM (RHEL 8 and 9 servers). Zabbix server with/without proxy involved. Large corporate network, hosts with dual stack enabled (IPv4/IPv6), however IPv4 only used in reality, DNS system replies with NOERROR and an empty response section in response to AAAA queries. DNS system replies with valid IPv4 addresses and TTL set between 10 and 15 minutes (depends on site and type of host in question) due to DHCP involved.

Sprint:
Prev.Sprint, S25-W8/9
Story Points:
3

Steps to reproduce:

Clean installation of 7.2.x in any form (rpm, spin up container)

Result:
Hosts with passive agent checks go up and down randomly not collecting items as expected.
Expected:
Stable status of hosts and collection of passive agent items

We made an attempt to upgrade from 6.4 to 7.2 but had to rollback due to this unfortunate experience. We are running sites with ~2K hosts monitored in this way.

We were also able to narrow down the root cause of the problem (at least as we believe):

This problem shows up for both passive agent checks as well as for SNMP checks, however for SNMP checks it resolves itself after a period of time as these devices are usually added with static IPv4 assignments.

For DHCP allocated hosts this problem is really relevant and we were not able to figure out a workaround neither from DNS side, nor by tweaking OS/container setup.

From DNS side we observe A and AAAA queries sent in parallel (which is normal) and responses sent back in no specific order. On Zabbix side we can see loads of log entries like this:
Zabbix agent item "system.cpu.util" on host "****" failed: first network error, wait for 15 seconds
Querying is done over UDP thus we have no control on the order of messages being delivered and processed on Zabbix side.

Using tcpdump we were able to figure out that failure happens most of the time when AAAA records get delivered before A records. This is where DHCP helped isolate the cause: AAAA responses are delivered instantly as they are cached with long TTLs (IPv6 addresses do not exist for hosts), while IPv4 addresses require more time to resolve when cache is invalidated due to TTL expiration. It is in this moment that it is clearly visible how hosts toggle their state from online to unavailable. Otherwise it happens fairly randomly depending on which replies get delivered first.

Tweaking responses to AAAA queries to be NXDOMAIN clearly causes host down following standard. Tweaking OS/container to disable IPv6 in kernel has no effect as libevent's implementation of getaddrinfo clearly disregards this. Adjusting getaddrinfo-allow-skew (libevent's invention ) option in /etc/resolv.conf to tolerate bigger delays between AAAA and A responses does not have an effect either.

On top of that looking into async_dns_event code (asyncpoller.c) and adding simple logging guards reveals that libevent always delivers only a single evutil_addrinfo in ai parameter as opposed to potentially expected 2: one for AAAA and another one for A records.

Would it be possible to either have an option to refrain from using libevent's implementation of getaddrinfo (use OS implementation) or implement using evdns_base_resolve_ipv4/ipv6 in a similar manner it is done for reverse lookup (as seen in async_event implementation in the same source file)?

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

rearm_timer.diff
3 kB
2025 Feb 18 12:15
samplelog1.log
332 kB
2025 Feb 17 16:31
samplelog2.log
566 kB
2025 Feb 18 12:25
ZBX-26014-7.3.diff
1 kB
2025 Feb 11 10:07
ZBX-26014-libevent-logging.diff
0.8 kB
2025 Feb 12 10:32
ZBX-26014-libevent-logging-dns-only.diff
0.9 kB
2025 Feb 12 10:48

duplicates

ZBX-25899 Monitoring issues with IPv6 when IPv4 is not available

Closed

is duplicated by

ZBX-24572 Host agent dont work with DNS name settings

Closed

part of

ZBXNEXT-1002 dns caching by zabbix daemons

Open

related to

ZBXNEXT-1275 Use c-ares for DNS resolving

Open

ZBX-24572 Host agent dont work with DNS name settings

Closed

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates