[ZBX-16220] Add zabbix agent logging about active items queue not send under high network load Created: 2019 Jun 05 Updated: 2024 Apr 10 Resolved: 2019 Oct 28 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent (G) |
Affects Version/s: | 4.0.7, 4.0.8 |
Fix Version/s: | None |
Type: | Problem report | Priority: | Major |
Reporter: | patrik uytterhoeven | Assignee: | Andrejs Tumilovics |
Resolution: | Won't fix | Votes: | 0 |
Labels: | activeagent, logging, network | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: |
![]() ![]() ![]() ![]() |
||||||||||||||||
Issue Links: |
|
||||||||||||||||
Team: | |||||||||||||||||
Sprint: | Sprint 56 (Sep 2019) | ||||||||||||||||
Story Points: | 0.25 |
Description |
Steps to reproduce:
Result: telnet to port 10051 will still work in zabbix agent no errors are visible zabbix-get will show items with data once the rsync is stopped everything will work
the solution was to run tuned-adm profile network-latency
the expected behaviour would be zabbix agent complaining that it cannot send information to the zabbix server |
Comments |
Comment by Edgar Akhmetshin [ 2019 Jun 11 ] |
Hello Patrik, Thank you for reporting the issue! Could you clarify:
Tuned-adm changes the following settings regarding network: [main] summary=Optimize for deterministic performance at the cost of increased power consumption, focused on low latency network performance include=latency-performance [vm] transparent_hugepages=never [sysctl] net.core.busy_read=50 net.core.busy_poll=50 net.ipv4.tcp_fastopen=3 kernel.numa_balancing=0 [bootloader] cmdline=skew_tick=1 Mostly this problem looks like poor network performance on high bandwidth workloads. Regards, |
Comment by patrik uytterhoeven [ 2019 Jun 11 ] |
Hi. On holiday ATM but this happend on a virtual machine bandwidth is not the issue as we have 10Gbit and the machine was reachable. I was running 4 rsyncs at that time with the latest centos 7. But barman was running also with a few hundred databases Will add more info later |
Comment by patrik uytterhoeven [ 2019 Jul 01 ] |
CentOS Linux release 7.6.1810 (Core) Linux 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
i will do the other tests later when i need to sync data again |
Comment by patrik uytterhoeven [ 2019 Jul 17 ] |
Hi, same issue on a new setup barman server that started to take full backups server per server with 1 rsync active but many ssh sessions to retrieve postgres wal files kernel : 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux CentOs: CentOS Linux release 7.6.1810 (Core) network: is kvm based, standard virtio driver
|
Comment by Edgar Akhmetshin [ 2019 Jul 19 ] |
Hello Patrik, I see buffer overruns, network IRQ handling only by one core without SR-IOV virtual TX/RX queue balancing supported by virtio, emulated QEMU Virtual CPU version 2.5+, and default network stack configuration (buffers, queues, etc...). Please, use CGROUP policy to assign enough resources for Zabbix Agent or optimise VM network performance. Be advised that this section of the tracker is for bug reports only. The case you have submitted can not be qualified as one, so please reach out to [email protected] for commercial support or consultancy services. Alternatively, you can also use our IRC channel or community forum (https://www.zabbix.com/forum) for assistance. With that said, we are closing this ticket. Thank you for understanding. Regards, |
Comment by patrik uytterhoeven [ 2019 Jul 19 ] |
i know that this is an issue but is this not the point of monitoring to get warnings when something goes wrong ? now i get nothing zabbix does not complain about not receiving any data from my agent and my agent behaves as nothing is going on either.
there are communication issues and the zabbix agent should notifiy this in the logs |
Comment by patrik uytterhoeven [ 2019 Jul 19 ] |
this is a bug
|
Comment by Edgar Akhmetshin [ 2019 Jul 22 ] |
Hello Patrik, nodata() trigger function can be used to avoid situation with Zabbix Agent (active) unable to send data, also log file will contain information about connection error problems. Please, provide Agent log file for the same time period with at least DebugLevel=3, while backup is in progress and you have no data from agent. Regards, |
Comment by patrik uytterhoeven [ 2019 Jul 23 ] |
Edgar
Like i mentioned in my initial post thats the problem there is nothing in the logs the agent behaves as there is no issue i know that there is a nodata trigger but thats not the ideal solution as you need it for active and passive and in the standard template for agent it is on passive not active a status icon is what is missing for active and a warning in the logs that data sent is not received by the zabbix server
i suppose that there is no verification on the data that is sent to the zabbix server if it was received or not |
Comment by Edgar Akhmetshin [ 2019 Jul 23 ] |
Hello
What is the debug level in the agent configuration and what timeout for active checks?
Looks like ZBXNEXT request. Found similar ZBXNEXT-3397 to be able to check Agent availability in active mode, please vote for it. Regards, |
Comment by patrik uytterhoeven [ 2019 Jul 23 ] |
Hi Edgar
de debug level is the standard DebugLevel=3
and the timeout was maxed to Timeout=30
|
Comment by patrik uytterhoeven [ 2019 Jul 26 ] |
https://support.zabbix.com/browse/ZBXNEXT-1686 this could also be a solution in the hope that zabbix agent collects the data for now i have increased buffers and added a nodata trigger for active items in the hope to bet some information from zabbix agent |
Comment by patrik uytterhoeven [ 2019 Jul 26 ] |
and after some testing its probably not under the high load but the problem imho is here https://support.zabbix.com/browse/ZBXNEXT-691 when removing active items and disabling the scripts on the agent everything seems to work fine but when more active checks come into play we lose item data
btw the original ticket has the code to add this to the logging of the agent something i can't find back in the zabbix agent code of 4.2.x |
Comment by Edgar Akhmetshin [ 2019 Aug 13 ] |
Hello Patrik, Agree with you, no errors if network has poor performance but connection is working. I'm able to reproduce with network conditioner for VirtualBOX and Parallels. Additional logging should be added to avoid such behaviour. Thank you. Regards, |
Comment by patrik uytterhoeven [ 2019 Aug 14 ] |
Hi Edgar
thx for the update
|
Comment by Vladislavs Sokurenko [ 2019 Oct 10 ] |
You should have seen following line in the agent log patrik but reported only once until it recovers: 14576:20191010:111638.042 active check configuration update from [127.0.0.1:10051] started to fail (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused) If we are not missing something and there were such lines in the log then this should have been Won't Fix |
Comment by Vladislavs Sokurenko [ 2019 Oct 10 ] |
Description of changes with default debug level 3 Was: 38956:20191010:113141.317 active check data upload to [127.0.0.1:10051] started to fail ([connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused) 38956:20191010:113229.679 active check configuration update from [127.0.0.1:10051] started to fail (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused) Now: 38956:20191010:113141.317 active check data upload to [127.0.0.1:10051] started to fail ([connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused) 38956:20191010:113141.317 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused 38956:20191010:113151.575 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused 38956:20191010:113201.841 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused 38956:20191010:113211.078 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused 38956:20191010:113221.423 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused 38956:20191010:113229.679 active check configuration update from [127.0.0.1:10051] started to fail (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused) 38956:20191010:113231.758 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused 38956:20191010:113241.169 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused 38956:20191010:113251.534 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused 38956:20191010:113301.869 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused 38956:20191010:113311.131 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused 38956:20191010:113321.427 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused 38956:20191010:113331.789 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused 38956:20191010:113341.090 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused |
Comment by patrik uytterhoeven [ 2019 Oct 10 ] |
@vladislavs Sokurenko that error only happens if the connection is broken complete in case you have to many active checks (more then the agent can handle) you will not get data in zabbix server and you will not see errors in the logs |
Comment by Andrejs Tumilovics [ 2019 Oct 10 ] |
patrik We are asking about Zabbix agent logs specifically. 14576:20191010:111638.042 active check configuration update from [127.0.0.1:10051] started to fail (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused) As vso mentioned, this line should appear only once, when active metric sending starts to fail. |
Comment by patrik uytterhoeven [ 2019 Oct 10 ] |
no nothing that was the problem there was nothing to be seen in the agent even after restarting the agent switching part of the items to passive was the only solution to solve the issue |
Comment by Vladislavs Sokurenko [ 2019 Oct 10 ] |
If there is no such error then it means that data was successfully transferred, maybe it was simply very slow and you would like to see warnings about that ? Have you also checked administration, queue it should have indicated that there are problems. Are you using default BufferSize parameter value ? |
Comment by patrik uytterhoeven [ 2019 Oct 10 ] |
data was only coming back after 2 day this happend on multiple systems even after rebooting the agent multiple times it would not work we had proxies between agent and server but even removing them would not give a solution now it works reliable after changing some LLD rules from Active to passive
we have about 2000 LLD items per host discovery rule runs every hour discovery item runs every hour |
Comment by Vladislavs Sokurenko [ 2019 Oct 10 ] |
Perhaps it is related to having so many checks that could potentially run for a long time, what was the BufferSize parameter ? |
Comment by patrik uytterhoeven [ 2019 Oct 10 ] |
we started with standard and increased it later but without success can't remember to what sizes we went but even then if the buffer was to small you would expect an error in the logs |
Comment by Andrejs Tumilovics [ 2019 Oct 14 ] |
To investigate this further Zabbix agent debug logs would be very helpful. Thank you! |
Comment by patrik uytterhoeven [ 2019 Oct 14 ] |
probably not as we changed the templates to passive so things work as it should and this week we will do the upgrade to 4.4
i will change it back to active in the hope to see some errors again but once agent 2 is available we will do the upgrade to v2 as there are some functionalities in v2 that we payed for. |
Comment by Andrejs Tumilovics [ 2019 Oct 14 ] |
The proper fix for this issue is planned in |
Comment by patrik uytterhoeven [ 2019 Oct 15 ] |
the Agent 2 at this moment is only available for RHEL 8 , ubuntu 18 will there be a package for RHEL 7 and ubuntu 16 ..... palivoda Please raise new ZBXNEXT. Agent2 just released. |
Comment by patrik uytterhoeven [ 2019 Oct 15 ] |
Comment by richlv [ 2019 Oct 28 ] |
Does this mean that the current agent will be deprecated and replaced by the Go agent? |
Comment by richlv [ 2019 Oct 29 ] |
Additionally, this issue is listed as "Fixed" in https://www.zabbix.com/rn/rn4.0.14rc1 - that seems incorrect. |
Comment by Alexander Vladishev [ 2019 Oct 29 ] |
richlv, Thanks a lot! Already fixed in ChangeLog files (4f75d38be55, 63634816456, 645c519afe5 and d7c40a741d4). Release notes will be updated soon. |
Comment by richlv [ 2019 Oct 29 ] |
Great, thank you Alexander. Any news on the agent deprecation? |