[ZBX-16220] Add zabbix agent logging about active items queue not send under high network load Created: 2019 Jun 05  Updated: 2024 Apr 10  Resolved: 2019 Oct 28

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 4.0.7, 4.0.8
Fix Version/s: None

Type: Problem report Priority: Major
Reporter: patrik uytterhoeven Assignee: Andrejs Tumilovics
Resolution: Won't fix Votes: 0
Labels: activeagent, logging, network
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File ZBX-16220.txt     File generate_items.py     File slow_checks.diff     XML File zbx_export_hosts.xml    
Issue Links:
Causes
caused by ZBXNEXT-1686 Persistent queue for active items in ... Closed
caused by ZBXNEXT-691 Agent: parallel active checks Closed
caused by ZBXNEXT-3397 Ability to use agent.ping both as act... Open
Team: Team C
Sprint: Sprint 56 (Sep 2019)
Story Points: 0.25

 Description   

Steps to reproduce:

  1. on centos 7 run multiple rsyncs 

Result:
active data will not be populated in the zabbix server passive will still work.

telnet to port 10051 will still work

in zabbix agent no errors are visible 

zabbix-get will show items with data once the rsync is stopped everything will work 

 

the solution was to run tuned-adm profile network-latency

 

the expected behaviour would be zabbix agent complaining that it cannot send information to the zabbix server 



 Comments   
Comment by Edgar Akhmetshin [ 2019 Jun 11 ]

Hello Patrik,

Thank you for reporting the issue! Could you clarify:

  • exact CentOS version and kernel used;
  • network adapter model and brand, driver version and speed;
  • total amount of traffic needed with rsync to stop active checks from working;
  • information from the following commands:
    cat /proc/interrupts >> /tmp/ZBX-16220.txt;
    cat /proc/cpuinfo >> /tmp/ZBX-16220.txt;
    free -m >> /tmp/ZBX-16220.txt;
    sysctl -a >> /tmp/ZBX-16220.txt;
    netstat -ntu >> /tmp/ZBX-16220.txt;
    ip -s a >> /tmp/ZBX-16220.txt;
    ip -s l >> /tmp/ZBX-16220.txt;
    ip tcp_metrics >> /tmp/ZBX-16220.txt;
    netstat -s >> /tmp/ZBX-16220.txt;
    
  • maybe you have network interface performance graphs for the mentioned period of time with rsync load;

Tuned-adm changes the following settings regarding network:

[main]
summary=Optimize for deterministic performance at the cost of increased power consumption, focused on low latency network performance
include=latency-performance

[vm]
transparent_hugepages=never

[sysctl]
net.core.busy_read=50
net.core.busy_poll=50
net.ipv4.tcp_fastopen=3
kernel.numa_balancing=0

[bootloader]
cmdline=skew_tick=1

Mostly this problem looks like poor network performance on high bandwidth workloads.

Regards,
Edgar

Comment by patrik uytterhoeven [ 2019 Jun 11 ]

Hi.

On holiday ATM but this happend on a virtual machine bandwidth is not the issue as we have 10Gbit and the machine was reachable. I was running 4 rsyncs at that time with the latest centos 7. But barman was running also with a few hundred databases

Will add more info later

Comment by patrik uytterhoeven [ 2019 Jul 01 ]

CentOS Linux release 7.6.1810 (Core)

Linux  3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

 

i will do the other tests later when i need to sync data again 

Comment by patrik uytterhoeven [ 2019 Jul 17 ]

ZBX-16220.txt

 

Hi,  same issue on a new setup barman server that started to take full backups server per server with 1 rsync active but many ssh sessions to retrieve postgres wal files

kernel :  3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

CentOs: CentOS Linux release 7.6.1810 (Core)

network: is kvm based, standard virtio driver

 

 

Comment by Edgar Akhmetshin [ 2019 Jul 19 ]

Hello Patrik,

I see buffer overruns, network IRQ handling only by one core without SR-IOV virtual TX/RX queue balancing supported by virtio, emulated QEMU Virtual CPU version 2.5+, and default network stack configuration (buffers, queues, etc...). Please, use CGROUP policy to assign enough resources for Zabbix Agent or optimise VM network performance.

Be advised that this section of the tracker is for bug reports only. The case you have submitted can not be qualified as one, so please reach out to [email protected] for commercial support or consultancy services. Alternatively, you can also use our IRC channel or community forum (https://www.zabbix.com/forum) for assistance. With that said, we are closing this ticket. Thank you for understanding.

Regards,
Edgar

Comment by patrik uytterhoeven [ 2019 Jul 19 ]

i know that this is an issue

but is this not the point of monitoring to get warnings when something goes wrong ?

now i get nothing zabbix does not complain about not receiving any data from my agent 

and my agent behaves as nothing is going on either.

 

there are communication issues and the zabbix agent should notifiy this in the logs 

Comment by patrik uytterhoeven [ 2019 Jul 19 ]

this is a bug

 

Comment by Edgar Akhmetshin [ 2019 Jul 22 ]

Hello Patrik,

nodata() trigger function can be used to avoid situation with Zabbix Agent (active) unable to send data, also log file will contain information about connection error problems. Please, provide Agent log file for the same time period with at least DebugLevel=3, while backup is in progress and you have no data from agent.

Regards,
Edgar

Comment by patrik uytterhoeven [ 2019 Jul 23 ]

Edgar 

 

Like i mentioned in my initial post

thats the problem there is nothing in the logs the agent behaves as there is no issue

i know that there is a nodata trigger but thats not the ideal solution as you need it for active and passive and in the standard template for agent it is on passive not active

a status icon is what is missing for active 

and a warning in the logs that data sent is not received by the zabbix server

 

i suppose that there is no verification on the data that is sent to the zabbix server if it was received or not 

Comment by Edgar Akhmetshin [ 2019 Jul 23 ]

Hello

thats the problem there is nothing in the logs the agent behaves as there is no issue

What is the debug level in the agent configuration and what timeout for active checks?

a status icon is what is missing for active

and a warning in the logs that data sent is not received by the zabbix server

Looks like ZBXNEXT request. Found similar ZBXNEXT-3397 to be able to check Agent availability in active mode, please vote for it.

Regards,
Edgar

Comment by patrik uytterhoeven [ 2019 Jul 23 ]

Hi Edgar

 

de debug level is the standard 

DebugLevel=3

 

and the timeout was maxed to

Timeout=30

 

Comment by patrik uytterhoeven [ 2019 Jul 26 ]

https://support.zabbix.com/browse/ZBXNEXT-1686

this could also be a solution in the hope that zabbix agent collects the data 

for now i have increased buffers and added a nodata trigger for active items in the hope to bet some information from zabbix agent 

Comment by patrik uytterhoeven [ 2019 Jul 26 ]

and after some testing its probably not under the high load but the problem imho is here

https://support.zabbix.com/browse/ZBXNEXT-691

when removing active items and disabling the scripts on the agent everything seems to work fine 

but when more active checks come into play we lose item data

 

btw the original ticket has the code to add this to the logging of the agent

something i can't find back in the zabbix agent code of 4.2.x

Comment by Edgar Akhmetshin [ 2019 Aug 13 ]

Hello Patrik,

Agree with you, no errors if network has poor performance but connection is working. I'm able to reproduce with network conditioner for VirtualBOX and Parallels.

Additional logging should be added to avoid such behaviour. Thank you.

Regards,
Edgar

Comment by patrik uytterhoeven [ 2019 Aug 14 ]

Hi Edgar

 

thx for the update

 

Comment by Vladislavs Sokurenko [ 2019 Oct 10 ]

You should have seen following line in the agent log patrik but reported only once until it recovers:

14576:20191010:111638.042 active check configuration update from [127.0.0.1:10051] started to fail (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)

If we are not missing something and there were such lines in the log then this should have been Won't Fix

Comment by Vladislavs Sokurenko [ 2019 Oct 10 ]

Description of changes with default debug level 3

Was:

38956:20191010:113141.317 active check data upload to [127.0.0.1:10051] started to fail ([connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
 38956:20191010:113229.679 active check configuration update from [127.0.0.1:10051] started to fail (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)

Now:

 38956:20191010:113141.317 active check data upload to [127.0.0.1:10051] started to fail ([connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
 38956:20191010:113141.317 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
 38956:20191010:113151.575 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
 38956:20191010:113201.841 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
 38956:20191010:113211.078 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
 38956:20191010:113221.423 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
 38956:20191010:113229.679 active check configuration update from [127.0.0.1:10051] started to fail (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)
 38956:20191010:113231.758 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
 38956:20191010:113241.169 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
 38956:20191010:113251.534 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
 38956:20191010:113301.869 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
 38956:20191010:113311.131 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
 38956:20191010:113321.427 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
 38956:20191010:113331.789 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
 38956:20191010:113341.090 send value error: [connect] cannot connect to [[127.0.0.1]:10051]: [111] Connection refused
Comment by patrik uytterhoeven [ 2019 Oct 10 ]

@vladislavs Sokurenko 

that error only happens if the connection is broken complete

in case you have to many active checks (more then the agent can handle) you will not get data in zabbix server

and you will not see errors in the logs

Comment by Andrejs Tumilovics [ 2019 Oct 10 ]

patrik We are asking about Zabbix agent logs specifically.
When described problem happens, and Zabbix server does not receive active metrics, do you have below message in Zabbix agent logs?

14576:20191010:111638.042 active check configuration update from [127.0.0.1:10051] started to fail (cannot connect to [[127.0.0.1]:10051]: [111] Connection refused)

As vso mentioned, this line should appear only once, when active metric sending starts to fail.

Comment by patrik uytterhoeven [ 2019 Oct 10 ]

no nothing

that was the problem 

there was nothing to be seen in the agent even after restarting the agent

switching part of the items to passive was the only solution to solve the issue

Comment by Vladislavs Sokurenko [ 2019 Oct 10 ]

If there is no such error then it means that data was successfully transferred, maybe it was simply very slow and you would like to see warnings about that ? Have you also checked administration, queue it should have indicated that there are problems. Are you using default BufferSize parameter value ?

Comment by patrik uytterhoeven [ 2019 Oct 10 ]

data was only coming back after 2 day

this happend on multiple systems even after rebooting the agent multiple times it would not work

we had proxies between agent and server but even removing them would not give a solution

now it works reliable after changing some LLD rules from Active to passive 

 

we have about 2000 LLD items per host 

discovery rule runs every hour 

discovery item runs every hour

Comment by Vladislavs Sokurenko [ 2019 Oct 10 ]

Perhaps it is related to having so many checks that could potentially run for a long time, what was the BufferSize parameter ?

Comment by patrik uytterhoeven [ 2019 Oct 10 ]

we started with standard and increased it later but without success can't remember to what sizes we went

but even then if the buffer was to small you would expect an error in the logs 

Comment by Andrejs Tumilovics [ 2019 Oct 14 ]

To investigate this further Zabbix agent debug logs would be very helpful.
patrik could you please attach Zabbix agent debug logs please, when this issue appears again.
You can temporarily enable these by issuing "zabbix_agentd -R log_level_increase".

Thank you!

Comment by patrik uytterhoeven [ 2019 Oct 14 ]

probably not as we changed the templates to passive 

so things work as it should and this week we will do the upgrade to 4.4

 

i will change it back to active in the hope to see some errors again but once agent 2 is available we will do the upgrade to v2 as there are some functionalities in v2 that we payed for.

Comment by Andrejs Tumilovics [ 2019 Oct 14 ]

The proper fix for this issue is planned in ZBXNEXT-691.
All we can do here is add one more log line if agent active checks thread cannot handle all his checks right in time slow_checks.diff.
Agent2 (Go agent) doesn't have such problem, because it runs active checks in parallel.
So, my suggestion is to close this ticket without fix.

Comment by patrik uytterhoeven [ 2019 Oct 15 ]

the Agent 2 at this moment is only available for RHEL 8 , ubuntu 18

will there be a package for RHEL 7 and ubuntu 16 .....

palivoda Please raise new ZBXNEXT. Agent2 just released.

Comment by patrik uytterhoeven [ 2019 Oct 15 ]

https://support.zabbix.com/browse/ZBX-16772

Comment by richlv [ 2019 Oct 28 ]

Does this mean that the current agent will be deprecated and replaced by the Go agent?

Comment by richlv [ 2019 Oct 29 ]

Additionally, this issue is listed as "Fixed" in https://www.zabbix.com/rn/rn4.0.14rc1 - that seems incorrect.

Comment by Alexander Vladishev [ 2019 Oct 29 ]

richlv, Thanks a lot!

Already fixed in ChangeLog files (4f75d38be55, 63634816456, 645c519afe5 and d7c40a741d4). Release notes will be updated soon.

Comment by richlv [ 2019 Oct 29 ]

Great, thank you Alexander.

Any news on the agent deprecation?

Generated at Sat Apr 27 08:24:30 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.