[ZBX-16643] Unavailable proxy might lead to excessive DNS requests from agents Created: 2019 Sep 16 Updated: 2019 Oct 07 |
|
Status: | Confirmed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent (G) |
Affects Version/s: | 4.0.12 |
Fix Version/s: | None |
Type: | Problem report | Priority: | Trivial |
Reporter: | Paal Braathen | Assignee: | DaneT |
Resolution: | Unresolved | Votes: | 0 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
RHEL7.7 agent and proxy |
Attachments: |
![]() ![]() |
Description |
Steps to reproduce:
Result: Agent will go into a failed state with the log: active check data upload to [proxy.example.com:10051] started to fail ([connect] cannot connect to [[proxy.example.com]:10051]: [111] Connection refused) Observe that the agent host sends an A and AAAA DNS request for the proxy each second (looks like this to me. I assume that the agent tries to reconnect every second). If you happen to get into a state where this happens for every agent you might be dealing with thousands/tens of thousands DNS requests each second. This could be considered a DDOS of the DNS server if every agent host has the same DNS server(s). This is of course setup dependent. Expected: The agent should somehow throttle down the reconnection attempts or have some kind of cooldown system when going into this failure state. Maybe this "reconnection throttle" could be added to the agent config (with a default much higher than every second). Additional: This happens when the proxy host actively refuses the connection. Stopping the proxy service might lead to the same thing as rejecting the connection in the firewall. As far as i know both RHEL and Ubuntu will actively refuse a connection if it's open in the firewall, but no services are listening on the port. Other OSes/distros might behave differently. |
Comments |
Comment by DaneT [ 2019 Sep 17 ] |
Hello pmbraat |
Comment by Paal Braathen [ 2019 Sep 17 ] |
What's your setup? In my production environment we use RHEL(7.7). I've been able to reproduce it in a test environment with CentOS 7. This isn't related to items. I'm not sure if you even need a single item linked to the host to reproduce this. I guess I could try to provide an ansible playbook or something that could be used as a full POC, but it'll take me a while. PS. I can't mention other users apparently.. |
Comment by Vladislavs Sokurenko [ 2019 Sep 18 ] |
I confirm that agent will retry sending collected values every second if it has failed to send and there is currently no way to increase this delay. Example log 820862:20190918:142449.897 In send_buffer() host:'127.0.0.1' port:10051 entries:100/100 820862:20190918:142449.898 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused 820862:20190918:142449.898 End of send_buffer():FAIL 820862:20190918:142450.898 In send_buffer() host:'127.0.0.1' port:10051 entries:100/100 820862:20190918:142450.898 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused |
Comment by DaneT [ 2019 Sep 18 ] |
pmbraat |
Comment by Glebs Ivanovskis [ 2019 Sep 19 ] |
Dear pmbraat, you can. It simply takes a bit more effort. First of all, you need to use [~username] syntax instead of @. Secondly, there will be no auto-complete, so you need to know the username. Finally, you can find the username from the link to user profile (simply hover mouse over someones name in JIRA). |
Comment by Paal Braathen [ 2019 Oct 02 ] |
Hi and sorry for the late reply. It looks like the agent might have to be configured with some item after all. I guess I should have provided some more log in the first entry. Here is a zabbix agent log (debug level 5) and a tcpdump from the same timespan. In the log the failure state starts at 11:32:00 and ends at 11:32:13. In this timespan you can see that there are a lot more DNS requests going out. The tcpdump log is complete while the agent was running (started/stopped before/after the agent). My problem and point is these excessive requests while the agent is in a failure state. If you have a very large amount of agents doing this at once you might experience problems (I did). I haven't experienced any problem with the amount of TCP reconnects myself. Only the DNS requests. PS. PPS. |
Comment by bunkzilla [ 2019 Oct 07 ] |
I found I had very strange behavior at times with my zabbix proxies and zabbix server when not running proxies if I wasn't using nscd. You may also want to look into implementing this.
|