[ZBX-16643] Unavailable proxy might lead to excessive DNS requests from agents Created: 2019 Sep 16  Updated: 2019 Oct 07

Status: Confirmed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 4.0.12
Fix Version/s: None

Type: Problem report Priority: Trivial
Reporter: Paal Braathen Assignee: DaneT
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

RHEL7.7 agent and proxy


Attachments: File tcpdump.log     File zabbix_agentd.log    

 Description   

Steps to reproduce:

  1. Zabbix with proxy
  2. Active agent with proxy referred to with FQDN/DNS name
  3. Observe outgoing DNS requests on agent host
    • E.g. tcpdump -i eth0 -l dst port 53
  4. Reject incoming TCP from agent host on proxy host
    • E.g. iptables -I INPUT 1 -s 1.2.3.4 -j REJECT

Result:

Agent will go into a failed state with the log:

active check data upload to [proxy.example.com:10051] started to fail ([connect] cannot connect to [[proxy.example.com]:10051]: [111] Connection refused)

Observe that the agent host sends an A and AAAA DNS request for the proxy each second (looks like this to me. I assume that the agent tries to reconnect every second).

If you happen to get into a state where this happens for every agent you might be dealing with thousands/tens of thousands DNS requests each second. This could be considered a DDOS of the DNS server if every agent host has the same DNS server(s). This is of course setup dependent.

Expected:

The agent should somehow throttle down the reconnection attempts or have some kind of cooldown system when going into this failure state. Maybe this "reconnection throttle" could be added to the agent config (with a default much higher than every second).

Additional:

This happens when the proxy host actively refuses the connection. Stopping the proxy service might lead to the same thing as rejecting the connection in the firewall. As far as i know both RHEL and Ubuntu will actively refuse a connection if it's open in the firewall, but no services are listening on the port. Other OSes/distros might behave differently.



 Comments   
Comment by DaneT [ 2019 Sep 17 ]

Hello pmbraat
I am sorry i could not confirm such spamming. Maybe you have a very active items which check host every second?

Comment by Paal Braathen [ 2019 Sep 17 ]

What's your setup? In my production environment we use RHEL(7.7). I've been able to reproduce it in a test environment with CentOS 7.

This isn't related to items. I'm not sure if you even need a single item linked to the host to reproduce this.

I guess I could try to provide an ansible playbook or something that could be used as a full POC, but it'll take me a while.

PS. I can't mention other users apparently..

Comment by Vladislavs Sokurenko [ 2019 Sep 18 ]

I confirm that agent will retry sending collected values every second if it has failed to send and there is currently no way to increase this delay.

Example log

820862:20190918:142449.897 In send_buffer() host:'127.0.0.1' port:10051 entries:100/100
820862:20190918:142449.898 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
820862:20190918:142449.898 End of send_buffer():FAIL
820862:20190918:142450.898 In send_buffer() host:'127.0.0.1' port:10051 entries:100/100
820862:20190918:142450.898 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
Comment by DaneT [ 2019 Sep 18 ]

pmbraat
Currently timeout/delay for those requests are not configurable. Unfortunately it is not that simple - case which works for you might not be useful for another user.
You can create a new feature request and ask for this feature. But this issue can not be considered a bug.

Comment by Glebs Ivanovskis [ 2019 Sep 19 ]

PS. I can't mention other users apparently..

Dear pmbraat, you can. It simply takes a bit more effort. First of all, you need to use [~username] syntax instead of @. Secondly, there will be no auto-complete, so you need to know the username. Finally, you can find the username from the link to user profile (simply hover mouse over someones name in JIRA).

Comment by Paal Braathen [ 2019 Oct 02 ]

Hi and sorry for the late reply.

It looks like the agent might have to be configured with some item after all.

I guess I should have provided some more log in the first entry. Here is a zabbix agent log (debug level 5) and a tcpdump from the same timespan.

zabbix_agentd.log
tcpdump.log

In the log the failure state starts at 11:32:00 and ends at 11:32:13. In this timespan you can see that there are a lot more DNS requests going out. The tcpdump log is complete while the agent was running (started/stopped before/after the agent).

My problem and point is these excessive requests while the agent is in a failure state. If you have a very large amount of agents doing this at once you might experience problems (I did). I haven't experienced any problem with the amount of TCP reconnects myself. Only the DNS requests.

PS.
This time I configured a single server with zabbix-server and zabbix-agent (no proxy). The server of course needs the be in DNS to see these DNS requests. To provoke the failure state I still did a (local port 10051) REJECT in iptables.

PPS.
cyclone Thank you. I guess I can mention

Comment by bunkzilla [ 2019 Oct 07 ]

I found I had very strange behavior at times with my zabbix proxies and zabbix server when not running proxies if I wasn't using nscd.    You may also want to look into implementing this. 

 

Generated at Sat Jun 14 17:05:47 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.