[ZBXNEXT-1002] dns caching by zabbix daemons Created: 2011 Oct 17  Updated: 2024 Apr 19

Status: Open
Project: ZABBIX FEATURE REQUESTS
Component/s: Agent (G)
Affects Version/s: 1.8.8, 2.0.0
Fix Version/s: None

Type: Change Request Priority: Minor
Reporter: Miguel Di Ciurcio Filho Assignee: Unassigned
Resolution: Unresolved Votes: 22
Labels: cache, dns, gethostbyname, ipv6, performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 10.04


Issue Links:
Duplicate
is duplicated by ZBX-17396 Non-existing DNS entries in Server= i... Closed

 Description   

I have configured agentd to connect to the Zabbix server by using a hostname, instead of an IP. It seams to me that for every item queried, gethostbyname() is called.

192.168.1.15 - DNS server in /etc/resolv.conf
192.168.1.13 - Host where the agent is running
zabbix.elabsis.com - Zabbix server

tcpdump sample from the 192.168.1.13:

17:12:15.075198 IP 192.168.1.15.53 > 192.168.1.13.46016: 25749* 1/2/2 A[|domain]
17:12:15.083781 IP 192.168.1.13.45727 > 192.168.1.15.53: 14928+ A? zabbix.elabsis.com. (36)
17:12:15.084509 IP 192.168.1.15.53 > 192.168.1.13.45727: 14928* 1/2/2 A[|domain]
17:12:15.153713 IP 192.168.1.13.42149 > 192.168.1.15.53: 8538+ A? zabbix.elabsis.com. (36)
17:12:15.154379 IP 192.168.1.15.53 > 192.168.1.13.42149: 8538* 1/2/2 A[|domain]
17:12:15.172920 IP 192.168.1.13.47295 > 192.168.1.15.53: 19772+ A? zabbix.elabsis.com. (36)
17:12:15.173582 IP 192.168.1.15.53 > 192.168.1.13.47295: 19772* 1/2/2 A[|domain]
17:12:16.371257 IP 192.168.1.13.33089 > 192.168.1.15.53: 48187+ A? zabbix.elabsis.com. (36)
17:12:16.371979 IP 192.168.1.15.53 > 192.168.1.13.33089: 48187* 1/2/2 A[|domain]
17:12:16.374865 IP 192.168.1.13.36497 > 192.168.1.15.53: 9540+ A? zabbix.elabsis.com. (36)
17:12:16.375738 IP 192.168.1.15.53 > 192.168.1.13.36497: 9540* 1/2/2 A[|domain]
17:12:16.434436 IP 192.168.1.13.43697 > 192.168.1.15.53: 12163+ A? zabbix.elabsis.com. (36)
17:12:16.435174 IP 192.168.1.15.53 > 192.168.1.13.43697: 12163* 1/2/2 A[|domain]
17:12:17.662556 IP 192.168.1.13.43674 > 192.168.1.15.53: 50570+ A? zabbix.elabsis.com. (36)
17:12:17.663198 IP 192.168.1.15.53 > 192.168.1.13.43674: 50570* 1/2/2 A[|domain]
17:12:17.684247 IP 192.168.1.13.42368 > 192.168.1.15.53: 9102+ A? zabbix.elabsis.com. (36)
17:12:17.684994 IP 192.168.1.15.53 > 192.168.1.13.42368: 9102* 1/2/2 A[|domain]
17:12:17.698499 IP 192.168.1.13.60401 > 192.168.1.15.53: 960+ A? zabbix.elabsis.com. (36)
17:12:17.699178 IP 192.168.1.15.53 > 192.168.1.13.60401: 960* 1/2/2 A[|domain]
17:12:17.727855 IP 192.168.1.13.51960 > 192.168.1.15.53: 8929+ A? zabbix.elabsis.com. (36)
17:12:17.728542 IP 192.168.1.15.53 > 192.168.1.13.51960: 8929* 1/2/2 A[|domain]
17:12:17.751009 IP 192.168.1.13.38707 > 192.168.1.15.53: 10705+ A? zabbix.elabsis.com. (36)
17:12:17.751738 IP 192.168.1.15.53 > 192.168.1.13.38707: 10705* 1/2/2 A[|domain]
17:12:18.916101 IP 192.168.1.13.38377 > 192.168.1.15.53: 33673+ A? zabbix.elabsis.com. (36)
17:12:18.916865 IP 192.168.1.15.53 > 192.168.1.13.38377: 33673* 1/2/2 A[|domain]
17:12:19.976746 IP 192.168.1.13.52731 > 192.168.1.15.53: 37538+ A? zabbix.elabsis.com. (36)
17:12:19.977385 IP 192.168.1.15.53 > 192.168.1.13.52731: 37538* 1/2/2 A[|domain]
17:12:20.007473 IP 192.168.1.13.49112 > 192.168.1.15.53: 1608+ A? zabbix.elabsis.com. (36)
17:12:20.008169 IP 192.168.1.15.53 > 192.168.1.13.49112: 1608* 1/2/2 A[|domain]
17:12:20.054203 IP 192.168.1.13.56993 > 192.168.1.15.53: 36012+ A? zabbix.elabsis.com. (36)
17:12:20.054905 IP 192.168.1.15.53 > 192.168.1.13.56993: 36012* 1/2/2 A[|domain]
17:12:20.109898 IP 192.168.1.13.54451 > 192.168.1.15.53: 13610+ A? zabbix.elabsis.com. (36)
17:12:20.110585 IP 192.168.1.15.53 > 192.168.1.13.54451: 13610* 1/2/2 A[|domain]
17:12:20.194716 IP 192.168.1.13.34243 > 192.168.1.15.53: 14421+ A? zabbix.elabsis.com. (36)
17:12:20.195350 IP 192.168.1.15.53 > 192.168.1.13.34243: 14421* 1/2/2 A[|domain]
17:12:20.198975 IP 192.168.1.13.49746 > 192.168.1.15.53: 37948+ A? zabbix.elabsis.com. (36)
17:12:20.199711 IP 192.168.1.15.53 > 192.168.1.13.49746: 37948* 1/2/2 A[|domain]
17:12:20.271991 IP 192.168.1.13.43933 > 192.168.1.15.53: 60673+ A? zabbix.elabsis.com. (36)
17:12:20.272650 IP 192.168.1.15.53 > 192.168.1.13.43933: 60673* 1/2/2 A[|domain]

After adding a dozen hosts using the hostname of the server, my internal DNS server is being hammered by the same requests over and over.

Using the IP address of the Zabbix server does not create any DNS traffic.



 Comments   
Comment by richlv [ 2011 Oct 18 ]

indeed, zabbix daemons don't do any dns caching, thus ip usage is suggested. alternatively, nscd or similar name server caching daemons could be used to reduce the amount of remote dns queries

Comment by Miguel Di Ciurcio Filho [ 2011 Oct 18 ]

Just a notice in case someone tries to use nscd in Debian or Ubuntu: hosts lookups are not cached by default.

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=335476

http://sourceware.org/bugzilla/show_bug.cgi?id=4428

You must edit /etc/nscd.conf to enable it, but it is not that reliable.

Comment by Miguel Di Ciurcio Filho [ 2011 Oct 18 ]

I have just noticed that zabbix_server has the same issue. When connecting to hosts using DNS name, for every item queried there is name a resolution call.

Comment by Cisco Vila [ 2013 Feb 01 ]

Has there been any solutions to this? We are experiencing the same issues for Enterprise installations.

Comment by Marc [ 2013 Feb 07 ]

As mentioned before, use nscd.
Name service caching is task of the operating system respectively task of systems like nscd. Alternatively one could use a local DNS server for caching.

Caching domain names has no place to be within ZABBIX applications.

Comment by Oleksii Zagorskyi [ 2013 Mar 29 ]

I've just checked zabbix agent (2.0.5) under Windows XP x32 and Windows 7 x64 - grab packets by wireshark.
As expected there no problems - windows caches resolved DNS names by default.
It's true as for requesting list of active checks (ServerActive=DNSname) and for agent passive incoming TCP connections (Server=DNSname)

The are windows build-in useful commands which are self-descriptive:
> ipconfig /displaydns
> ipconfig /flushdns

Just a note - some 3rd party software can optionally? disable the caching.
I know at least one - Kaspersky antivirus provides possibility to disable it during installation.

Comment by Oleksii Zagorskyi [ 2013 Mar 30 ]

I've performed some additional experiments under Debian 6.0.7
Linux it0 3.2.0-3-amd64 #1 SMP Mon Jul 23 02:45:17 UTC 2012 x86_64 GNU/Linux
Agent is ~2.0.5 compiled with IPv6 support. I'm using only IPv4 addresses in the experiments.

Every time when an agent need to know Server's IP the agent host sends two DNS queries simultaneously: "Type: A (Host address)" and "Type: AAAA (IPv6 address)" and then gets two DNS answers respectively. (ZBX-4252 also mentions this and with some details)

For active checks: if BufferSend is default 5 seconds then agent 12 times per one minute performs queries to DNS server + one query for list of active checks.
This is true disregarding on count of monitored items and their update interval (I hope you don't overflow agent's BufferSize here )
So not so bad

But it more critical for passive checks. For every incoming TCP connection from zabbix server agent performs separate query to DNS server.
So you can calculate agent_items*update interval=number of queries for a period from every agent host to DNS server.

Comment by Oleksii Zagorskyi [ 2013 Mar 30 ]

ONLY about agent:

What if will cache only IP of resolved "Server" and "ServerActive" parameters and will refresh it with some "hardcoded|configured|TTL from DNS response" period ?

It will be maybe 1-5 values (remember about multiple servers for active checks support) which agent has to handle and it will resolve current issue.
It should be not so hard to implement, IMO.

I don't see much sense to cache any other IPs possibly used in items.

Comment by Oleksii Zagorskyi [ 2013 Mar 30 ]

I've just tested agent on a FreeBSD 8.1 x32 host.
Agent is also compiled with IPv6 support.

Behavior is the same - DNS query for every incoming TCP connection from server.
But order for queries a bit differs from Linux. Here is IPv4 query-response first and then IPv6 ones:

01:50:16.930705 IP 10.20.0.20.20708 > 10.20.0.10.53: 21340+ A? mon2 (33)
01:50:16.930886 IP 10.20.0.10.53 > 10.20.0.20.20708: 21340* 1/0/0 A 10.20.0.32 (49)
01:50:16.930929 IP 10.20.0.20.13973 > 10.20.0.10.53: 21341+ AAAA? mon2 (33)
01:50:16.931051 IP 10.20.0.10.53 > 10.20.0.20.13973: 21341* 0/1/0 (113)

I don't think I need to test something else.

Comment by Oleksii Zagorskyi [ 2013 Mar 30 ]

Just additional small test on Debian.
Zabbix server v2.0.5 compiled with and without IPv6 support for single passive item check where host monitored by DNS name:
with IPv6 - performs two queries to DNS;
without IPv6 - performs one query to DNS.

Comment by Oleksii Zagorskyi [ 2013 Mar 30 ]

heh, last comment about zabbix sources.

There is some difference - with IPv6 zabbix daemons use "getaddrinfo()" system call, but without IPv6 - another system call - "gethostbyname()".
See /src/libs/zbxcomms/comms.c lines 282 and 348

This is also mentioned in ZBX-6326

Comment by richlv [ 2014 Jan 14 ]

if we ever look into implementing this, it should be very well documented and there must be a way to drop this internal dns cache (like we can reload config for server) - lately i've hit some other software doing its own dns caching, and that can be mighty confusing

Comment by Volker Fröhlich [ 2014 Sep 30 ]

This issue is somewhat connected to ZBXNEXT-1862. One thing I find worth mentioning, is, that different caching solutions offer different feature sets. I remember comparing nscd, dnsmasq and bind at some point. I think to remember that one would not cache PTR records, the other one would not cache MX records, which can be relevant on the Zabbix server machine. My opinion matches Marc's: I don't consider it a Zabbix issue.

Comment by MightyDok [ 2015 Aug 07 ]

Russian hosting provider RTCOMM have DNS rate limit settings for domains, so with standard linux template zabbix server fault to query agent with error "ZBX_TCP_READ() failed: [4] Interrupted system call". I think we need this feature ASAP.

Comment by Dmitry [ 2017 Apr 18 ]

Lab:

  • LAN
  • ~100 agents
  • 2 DNS server (master-slave)
  • 1 zabbix-server 3.0.2 + mysql

Works for me:
1. Zabbix Server-related overload solve by nscd
2. Zabbix client-related overload by:

  • setting short hostname in Server= and ServerActive= (like Server=zabbix instead of Server=zabbix.mydomain)
  • enshure that /etc/resolv.conf has full domain at first place in 'search' parameter (like 'mydomain.myprovider myprovider', but not 'myprovider mydomain.myprovider'
  • add 'option rotate' to resolv.conf

Agent config + correct 'search' parameter make first zabbix agent lookup success. So, it doesnt try other suffixes (only 2 request per connection)
'option rotate' balance loading between master and slave DNS (may not affect on zabbix agent, but balance other requests)

Other way: configure nscd on each agent host. But nscd may cause of mistiming between real network and nscd cache
On zabbix server nscd's negative cache with 20min TTL increase unreachable pollers load

Comment by amg1127 [ 2017 Oct 21 ]

I agree with caching of IP addresses of hostnames supplied on "Server=" and "ServerActive=" parameters (Oleksiy Zagorskyi's suggestion above). For example:

  • When Zabbix agent is initializing and parsing its configuration file, it could resolve hostnames set by the administrator and keep IP addresses recorded in memory until the agent is killed by a TERM, HUP or USR1 signal.
  • When Zabbix agent accepts a connection request from a server or proxy that is willing to perform a passive check, it could match the client address to one of the cached IP addresses.

The network I manage has a Zabbix deployment composed by a Zabbix server, some geographically dispersed Zabbix proxies, more than 2700 monitored hosts and more than 400000 monitored items. Until my team implements Zabbix muthual authentication, we are relying on hostnames specified in "Server=" agentd parameter to provide monitoring access control. However, Zabbix infrastructure is currently consuming a relevant amount of resources from our internal DNS servers, because the absence of a cache produces one name resolution query for every passive metric collect.

Comment by Vitaly Zhuravlev [ 2018 Nov 30 ]

Another approach is to use systemd-resolved if you are on systemd

Generated at Fri Apr 19 16:26:49 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.