[ZBX-15869] proc.cpu.util +AMD 4 Core 3.5 Ghz processor Created: 2019 Mar 25  Updated: 2024 Apr 10  Resolved: 2019 Oct 18

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 3.0.25
Fix Version/s: None

Type: Incident report Priority: Minor
Reporter: Brian Gilbert Assignee: Alex Kalimulin
Resolution: Cannot Reproduce Votes: 0
Labels: agent, cpu, zabbix
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

HP DL385p Gen8 AMD 4 Core 3.5 Ghz


Attachments: PNG File image-2019-03-25-09-54-02-587.png     PNG File image-2019-03-25-10-29-53-572.png     PNG File image-2019-03-25-10-41-54-270.png     Text File strace_no_proc_cpu.log     Zip Archive strace_proc_cpu.zip     Text File tfabrtu4 cpu zabbixagent.log     File zabbix_agentd.log.coll_debug     File zabbix_agentd.log.debug    
Team: Team I
Sprint: Sprint 56 (Sep 2019), Sprint 55 (Aug 2019), Sprint 51 (Apr 2019), Sprint 52 (May 2019), Sprint 53 (Jun 2019), Sprint 54 (Jul 2019), Sprint 57 (Oct 2019)

 Description   

Steps to reproduce:

  1. Use "proc.cpu.util" to monitor CPU used by a process.. Causes Spike in CPU Uitlization on the monitored server by the zabbix agent collector process
  2. Even after disabling the Item - CPU does not go back to "Normal" until I recycle the Zabbix Agent.

CPU Normal - Item disabled

CPU Spiked - While Monitoring the process

See screenshot...
Expected:
See screenshot....
See attached patch file...



 Comments   
Comment by Brian Gilbert [ 2019 Mar 26 ]

tfabrtu4 cpu zabbixagent.log

 

Debug level 5 logs for tfabrtu4.  Started around 12:18…you see this

26667:20190325:121801.571 Requested [proc.cpu.util[zabbix_agentd]]

 

Then this

 

[root@tfabrtu4 ~]# top |grep zabbix

26664 root      20   0 75132 1232  824 S  0.6  0.0   0:00.07 zabbix_agentd

26664 root      20   0 78500 4808  944 S 20.0  0.0   0:00.76 zabbix_agentd

26664 root      20   0 78500 4812  948 S 28.6  0.0   0:01.73 zabbix_agentd

26664 root      20   0 75332 1552  948 R 19.2  0.0   0:02.37 zabbix_agentd

26664 root      20   0 78500 4816  952 S 27.3  0.0   0:03.29 zabbix_agentd

 

[root@tfabrtu4 ~]# date

Mon Mar 25 12:18:20 EDT 2019

[root@tfabrtu4 ~]#

Comment by Brian Gilbert [ 2019 Apr 02 ]

Used this command 2 times (our default log level is 3)

zabbix_agentd -R log_level_increase=collector...

Attached is the log

zabbix_agentd.log.coll_debug

Comment by Vladislavs Sokurenko [ 2019 Apr 02 ]

Have much does the top consume when compared to Zabbix agent ?

Comment by Alex Kalimulin [ 2019 Apr 03 ]

zeb1026, the new log is interesting. Take a look at time around 10:13:03-05:

 30540:20190402:101303.220 In zbx_proc_get_processes()
 30540:20190402:101303.245 End of zbx_proc_get_processes(): SUCCEED, processes:3862

The agent spent 25ms in zbx_proc_get_processes(), which is reasonable given the number of processes you're running. Now this:

 30540:20190402:101305.609 In zbx_proc_get_processes()
 30540:20190402:101305.964 End of zbx_proc_get_processes(): SUCCEED, processes:3862

It's 355ms (14x times more!!!) for the same number of processes. The only thing zbx_proc_get_processes() does is accessing process info in /proc, so it looks like a kernel has slowed down access to procfs drastically by some reason. Here are the stats from your log (left column is time, right is time in ms spent in zbx_proc_get_processes():

101252785 28 ms
101253833 26 ms
101254877 24 ms
101255918 28 ms
101256972 24 ms
101258013 25 ms
101259057 24 ms
101300098 23 ms
101301138 24 ms
101302179 24 ms
101303220 25 ms
101304264 334 ms
101305609 355 ms
101306984 353 ms
101308350 337 ms
101309702 380 ms
101311095 347 ms
101312456 485 ms
101313957 483 ms
101315459 398 ms
101316870 402 ms

Now the question is do you have any idea what's happened in your server at 10:13:04? Do you have any Zabbix stats or graphs with CPU load, jumps etc?

Comment by Brian Gilbert [ 2019 Apr 04 ]

That is when I enabled the item doing proc.cpu.util[zabbix_agentd]  collecting every 2min

 

 

 

 

 

Comment by Alex Kalimulin [ 2019 Apr 05 ]

zeb1026, can you please do the following:

  1. Disable your proc.cpu.util[] items, restart the agent, increase log level of collector to 5, note the collector's pid (ps -fe|grep zabbix_agent|grep collector) and collect the info with strace:
    strace -r -T -p PID -o strace.log

    Leave strace running for 3-5 minutes.

  2. Enable the proc.cpu.util[] items, reload configuration cache and attach to this PID with strace once again (you can redirect strace output to another log file), run for another 3-5 minutes.

Then please attach zabbix_agentd.log and both strace logs here.

Comment by Brian Gilbert [ 2019 Apr 11 ]

Attached are the zabbix agent log and 2 strace's...

Hmmm  on of the traces is too big 78M - I will try to  zip it first

strace_proc_cpu.zip

 

 

Comment by Alex Kalimulin [ 2019 Apr 16 ]

zeb1026, thanks for the files. I don't see any anomalies here and cannot reproduce the problem locally so far. Even very slow configurations consistently give me 5 to 10 times better results in the environments with several thousand processes. The problem seems to be slow access to /proc/pid/stat entries.

What distro and kernel version are you running?

Comment by Brian Gilbert [ 2019 May 06 ]

Sorry for the delay in responding....

 

uname -a
Linux tfabrtu4 2.6.32-754.6.3.el6.x86_64 #1 SMP Tue Sep 18 10:29:08 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@tfabrtu4 ~]#

Comment by Brian Gilbert [ 2019 Jun 05 ]

Hi Alexander

Is there anything else I can gather to help you?

our zabbix server is Zabbix 3.0.14

[root@k0005089 sbin]# zabbix_proxy -V
zabbix_proxy (Zabbix) 3.0.14
Revision 76338 27 December 2017, compilation time: Dec 27 2017 11:17:20

Agent Version is 3.0.4
[root@tfabrtu4 sbin]# zabbix_agentd -V
zabbix_agentd (daemon) (Zabbix) 3.0.4
Revision 61185 15 July 2016, compilation time: Jul 24 2016 02:57:44

Is there a more recent agent version that is compatible with our zabbix server / proxy?

Comment by Alex Kalimulin [ 2019 Oct 18 ]

zeb1026, I cannot reproduce this problem even though I've tested exact same kernel version and OS. May I suggest to upgrade to the latest version and check if the problem persists. If it does, please open a new ticket. As for version compatibility, please see here: https://www.zabbix.com/documentation/4.4/manual/appendix/compatibility

Generated at Wed Apr 24 21:24:58 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.