[ZBX-10710] system.cpu.util show incorrect utilization Created: 2016 Apr 26  Updated: 2024 Apr 10  Resolved: 2017 Nov 23

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 3.0.2
Fix Version/s: 3.0.14rc1, 3.4.5rc1, 4.0.0alpha1, 4.0 (plan)

Type: Problem report Priority: Critical
Reporter: Dmitry Zykov Assignee: Valdis Kauķis (Inactive)
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

zabbix_agentd (daemon) (Zabbix) 3.0.2
Revision 59540 20 April 2016, compilation time: Apr 20 2016 14:42:06

CentOS Linux release 7.2.1511 (Core)

Linux 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Intel(R) Xeon(R) CPU D-1540 @ 2.00GHz


Attachments: PNG File Screen Shot 2017-08-27 at 9.37.10 PM.png     PNG File cpu.util wit guest.png     PNG File zabbix-cpu.PNG    
Issue Links:
Duplicate
is duplicated by ZBX-12163 system.cpu.util computed incorrectly Closed
is duplicated by ZBX-11174 CPU graph total is not 100% since Zab... Closed
Team: Team A
Sprint: Sprint 19, Sprint 20, Sprint 21
Story Points: 6

 Description   

The zabbix agent send incorrect utilization of cpu which twice less than output of top command:

On agent server:

[root@xxx]# top|grep Cpu
%Cpu(s): 84.6 us,  8.3 sy,  0.0 ni,  5.5 id,  0.0 wa,  0.0 hi,  1.6 si,  0.0 st
%Cpu(s): 82.8 us,  8.7 sy,  0.0 ni,  6.9 id,  0.0 wa,  0.0 hi,  1.6 si,  0.0 st
%Cpu(s): 84.9 us,  9.4 sy,  0.0 ni,  4.0 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu(s): 82.7 us,  9.1 sy,  0.0 ni,  6.5 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu(s): 84.1 us,  8.1 sy,  0.0 ni,  6.0 id,  0.1 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu(s): 84.9 us,  8.1 sy,  0.0 ni,  5.6 id,  0.0 wa,  0.0 hi,  1.4 si,  0.0 st
%Cpu(s): 84.3 us,  7.9 sy,  0.0 ni,  6.2 id,  0.0 wa,  0.0 hi,  1.6 si,  0.0 st

At the same time on zabbix server:

[root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[]
49.139935
[root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[,user]
49.171261
[root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[,system]
4.926904
[root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[]
49.079076
[root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[,user]
49.082120
[root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[,system]
4.890176

And the same incorrect values for another server:

[root@xxx2 ~]# top|grep Cpu
%Cpu(s): 96.6 us,  2.7 sy,  0.0 ni,  0.1 id,  0.0 wa,  0.0 hi,  0.6 si,  0.0 st
%Cpu(s): 96.5 us,  2.8 sy,  0.0 ni,  0.2 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st
%Cpu(s): 96.3 us,  3.0 sy,  0.0 ni,  0.1 id,  0.0 wa,  0.0 hi,  0.6 si,  0.0 st
%Cpu(s): 94.3 us,  4.6 sy,  0.0 ni,  0.3 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu(s): 96.2 us,  2.7 sy,  0.0 ni,  0.2 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu(s): 96.2 us,  2.7 sy,  0.0 ni,  0.4 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu(s): 95.2 us,  2.9 sy,  0.0 ni,  1.1 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu(s): 95.9 us,  2.7 sy,  0.0 ni,  0.7 id,  0.0 wa,  0.0 hi,  0.6 si,  0.0 st
%Cpu(s): 96.4 us,  2.6 sy,  0.0 ni,  0.5 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st
[root@ bin]# zabbix_get -sxxx2 -p10050 -k system.cpu.util[]
49.773875
[root@yyy bin]# zabbix_get -sxxx2 -p10050 -k system.cpu.util[,user]
49.734167
[root@yyy bin]# zabbix_get -sxxx2 -p10050 -k system.cpu.util[,system]
1.461466
[root@yyy bin]# zabbix_get -sxxx2 -p10050 -k system.cpu.util[]
49.714295
[root@yyy bin]# zabbix_get -sxxx2 -p10050 -k system.cpu.util[,user]
49.711913
[root@yyy bin]# zabbix_get -sxxx2 -p10050 -k system.cpu.util[,system]
1.471862

In attach the screenshot for cpu.util from this 2 servers.
I can reproduce this also in anouther servers, not only this 2



 Comments   
Comment by Aleksandrs Saveljevs [ 2016 Apr 27 ]

Did the issue start after an upgrade? Did it work correctly with a previous version of Zabbix?

Comment by Dmitry Zykov [ 2016 Apr 27 ]

This is new server with newest agent. I had try older versions of the agent on this server, this bug appeared from 3.0.0

zabbix agent 3.0.1

[root@xxx /]# zabbix_agentd -V
zabbix_agentd (daemon) (Zabbix) 3.0.1
Revision 58734 26 February 2016, compilation time: Feb 28 2016 02:15:42
...
[root@xxx /]# top|grep Cpu
%Cpu(s): 77.6 us,  7.0 sy,  0.0 ni, 14.0 id,  0.0 wa,  0.0 hi,  1.3 si,  0.0 st
%Cpu(s): 79.1 us,  8.1 sy,  0.0 ni, 11.2 id,  0.1 wa,  0.0 hi,  1.4 si,  0.0 st
%Cpu(s): 78.2 us,  7.9 sy,  0.0 ni, 12.5 id,  0.1 wa,  0.0 hi,  1.3 si,  0.0 st
%Cpu(s): 79.1 us,  8.1 sy,  0.0 ni, 11.3 id,  0.1 wa,  0.0 hi,  1.4 si,  0.0 st
%Cpu(s): 76.1 us,  8.2 sy,  0.0 ni, 14.4 id,  0.0 wa,  0.0 hi,  1.2 si,  0.0 st

zabbix server at this time

[root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[]
45.330975
[root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[,user]
45.231920
[root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[,system]
5.140191

zabbix agent 3.0.0

[root@xxx /]# zabbix_agentd -V
zabbix_agentd (daemon) (Zabbix) 3.0.0
Revision 58460 15 February 2016, compilation time: Feb 20 2016 04:32:59

[root@xxx /]# top|grep Cpu
%Cpu(s): 24.0 us,  5.3 sy,  0.0 ni, 69.9 id,  0.3 wa,  0.0 hi,  0.6 si,  0.0 st
%Cpu(s): 83.0 us,  8.4 sy,  0.0 ni,  6.8 id,  0.2 wa,  0.0 hi,  1.6 si,  0.0 st
%Cpu(s): 81.9 us,  9.2 sy,  0.0 ni,  6.8 id,  0.4 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu(s): 81.2 us,  8.0 sy,  0.0 ni,  8.8 id,  0.4 wa,  0.0 hi,  1.6 si,  0.0 st
%Cpu(s): 80.3 us,  8.3 sy,  0.0 ni,  9.6 id,  0.3 wa,  0.0 hi,  1.6 si,  0.0 st
%Cpu(s): 74.5 us,  8.5 sy,  0.0 ni, 14.8 id,  0.5 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu(s): 78.8 us,  9.7 sy,  0.0 ni,  9.4 id,  0.3 wa,  0.0 hi,  1.8 si,  0.0 st

zabbix server at this time

[root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[]
45.510419
[root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[,user]
45.407713
[root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[,system]
4.840851

zabbix agent 2.4.7

[root@xxx /]# zabbix_agentd -V
Zabbix Agent (daemon) v2.4.7 (revision 56694) (12 November 2015)
Compilation time: Nov 13 2015 10:42:17

%Cpu(s): 79.4 us,  8.6 sy,  0.0 ni, 10.0 id,  0.0 wa,  0.0 hi,  2.0 si,  0.0 st
%Cpu(s): 79.7 us,  8.6 sy,  0.0 ni,  9.6 id,  0.1 wa,  0.0 hi,  2.0 si,  0.0 st
%Cpu(s): 80.4 us,  8.1 sy,  0.0 ni,  9.4 id,  0.1 wa,  0.0 hi,  2.0 si,  0.0 st
%Cpu(s): 75.5 us,  9.0 sy,  0.0 ni, 13.4 id,  0.1 wa,  0.0 hi,  2.0 si,  0.0 st
%Cpu(s): 71.3 us,  8.0 sy,  0.0 ni, 18.8 id,  0.1 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu(s): 76.6 us,  8.4 sy,  0.0 ni, 13.0 id,  0.1 wa,  0.0 hi,  2.0 si,  0.0 st

zabbix server at this time

[root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[]
67.133790
[root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[,user]
67.072900
[root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[,system]
8.109336
[root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[]
67.281369
[root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[,user]
67.043161
[root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[,system]
Comment by Aleksandrs Saveljevs [ 2016 Apr 27 ]

Thank you for the information! Do you use any virtualization technologies and are CPUs allocated dynamically?

Comment by Dmitry Zykov [ 2016 Apr 27 ]

Yes, the main role of this servers is KVM virtualization where CPUs allocated dynamically.

Comment by Aleksandrs Saveljevs [ 2016 Apr 27 ]

Could you please try to monitor "system.cpu.util[,guest]" and "system.cpu.util[,guest_nice]" items (see https://www.zabbix.com/documentation/3.0/manual/config/items/itemtypes/zabbix_agent )? If you include these items, do the CPU items add up to 100% then?

Comment by Aleksandrs Saveljevs [ 2016 Apr 27 ]

Somewhat related issue: ZBX-9786.

Comment by Dmitry Zykov [ 2016 Apr 28 ]
zabbix_agentd (daemon) (Zabbix) 3.0.2
Revision 59540 20 April 2016, compilation time: Apr 20 2016 14:42:06
# top|grep Cpu
%Cpu(s): 95.1 us,  3.6 sy,  0.0 ni,  0.6 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu(s): 95.7 us,  3.3 sy,  0.0 ni,  0.4 id,  0.0 wa,  0.0 hi,  0.6 si,  0.0 st
%Cpu(s): 96.0 us,  3.1 sy,  0.0 ni,  0.2 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu(s): 96.1 us,  2.9 sy,  0.0 ni,  0.4 id,  0.0 wa,  0.0 hi,  0.6 si,  0.0 st
%Cpu(s): 94.0 us,  3.8 sy,  0.0 ni,  1.5 id,  0.0 wa,  0.0 hi,  0.6 si,  0.0 st
%Cpu(s): 95.3 us,  3.2 sy,  0.0 ni,  0.8 id,  0.1 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu(s): 92.9 us,  4.7 sy,  0.0 ni,  1.3 id,  0.0 wa,  0.0 hi,  1.1 si,  0.0 st
%Cpu(s): 92.0 us,  5.9 sy,  0.0 ni,  1.0 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu(s): 95.4 us,  3.5 sy,  0.0 ni,  0.2 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
while true; do
echo -n 'total: ' && zabbix_get -sxxx -p10050 -k system.cpu.util[];
echo -n 'user: ' && zabbix_get -sxxx -p10050 -k system.cpu.util[,user];
echo -n 'system: '&& zabbix_get -sxxx -p10050 -k system.cpu.util[,system];
echo -n 'guest: ' && zabbix_get -sxxx -p10050 -k system.cpu.util[,guest];
echo -n 'guest_nice: ' && zabbix_get -sxxx -p10050 -k system.cpu.util[,guest_nice];
echo '------------'
sleep 1; done;

total: 51.249574
user: 51.249574
system: 2.003069
guest: 45.486419
guest_nice: 0.000000
------------
total: 51.242526
user: 51.242526
system: 1.999636
guest: 45.477230
guest_nice: 0.000000
------------
total: 51.212707
user: 51.212707
system: 2.002614
guest: 45.485026
guest_nice: 0.000000
------------
Comment by Aleksandrs Saveljevs [ 2016 Apr 28 ]

Great! So it seems that everything is correct - Zabbix counts "guest" time separately, while "top" seems to add it to "user" time.

A fragment from "man proc":

      /proc/stat
	      kernel/system statistics.  Varies with architecture.  Common entries include:

	      cpu  3357 0 4313 1362393
		     The  amount  of time, measured in units of USER_HZ (1/100ths of a second on most architectures, use sysconf(_SC_CLK_TCK) to obtain the
		     right value), that the system spent in various states:

		     user   (1) Time spent in user mode.

		     nice   (2) Time spent in user mode with low priority (nice).

		     system (3) Time spent in system mode.

		     idle   (4) Time spent in the idle task.  This value should be USER_HZ times the second entry in the /proc/uptime pseudo-file.

		     iowait (since Linux 2.5.41)
			    (5) Time waiting for I/O to complete.

		     irq (since Linux 2.6.0-test4)
			    (6) Time servicing interrupts.

		     softirq (since Linux 2.6.0-test4)
			    (7) Time servicing softirqs.

		     steal (since Linux 2.6.11)
			    (8) Stolen time, which is the time spent in other operating systems when running in a virtualized environment

		     guest (since Linux 2.6.24)
			    (9) Time spent running a virtual CPU for guest operating systems under the control of the Linux kernel.

		     guest_nice (since Linux 2.6.33)
			    (10) Time spent running a niced guest (virtual CPU for guest operating systems under the control of the Linux kernel).
Comment by Dmitry Zykov [ 2016 Apr 28 ]

I add system.cpu.util[,guest] to the graph, now it seems OK. Add screenshot of this.

But agent CPU utilisation type: total (default) is still bugged, it's not include "guest" time.

Comment by Dmitry Zykov [ 2016 Apr 28 ]

And the system time is still twice less, then in top output.

Comment by Aleksandrs Saveljevs [ 2016 Apr 28 ]

Note that according to https://www.zabbix.com/documentation/3.0/manual/config/items/itemtypes/zabbix_agent the default is not "total" (there is no such value), but "user".

Comment by Aleksandrs Saveljevs [ 2016 Apr 28 ]

Great! So it seems that everything is correct - Zabbix counts "guest" time separately, while "top" seems to add it to "user" time.

The last part seems to be wrong - if "top" shows the same results as Zabbix 2.4, then "top" simply ignores "guest" time, not adds it to "user" time.

Taking a brief look at "top" source code at http://procps.sourceforge.net/index.html seems to confirm it. The program only reads 8 values from /proc/stat:

   num = sscanf(buf, "cpu %Lu %Lu %Lu %Lu %Lu %Lu %Lu %Lu",
      &cpus[Cpu_tot].u,
      &cpus[Cpu_tot].n,
      &cpus[Cpu_tot].s,
      &cpus[Cpu_tot].i,
      &cpus[Cpu_tot].w,
      &cpus[Cpu_tot].x,
      &cpus[Cpu_tot].y,
      &cpus[Cpu_tot].z
   );

Zabbix, since version 3.0, reads 10 values:

sscanf(line, "%*s " ZBX_FS_UI64 " " ZBX_FS_UI64 " " ZBX_FS_UI64 " " ZBX_FS_UI64
		" " ZBX_FS_UI64 " " ZBX_FS_UI64 " " ZBX_FS_UI64 " " ZBX_FS_UI64
		" " ZBX_FS_UI64 " " ZBX_FS_UI64,
		&counter[ZBX_CPU_STATE_USER], &counter[ZBX_CPU_STATE_NICE],
		&counter[ZBX_CPU_STATE_SYSTEM], &counter[ZBX_CPU_STATE_IDLE],
		&counter[ZBX_CPU_STATE_IOWAIT], &counter[ZBX_CPU_STATE_INTERRUPT],
		&counter[ZBX_CPU_STATE_SOFTIRQ], &counter[ZBX_CPU_STATE_STEAL],
		&counter[ZBX_CPU_STATE_GCPU], &counter[ZBX_CPU_STATE_GNICE]);

So it currently seems like there is nothing to fix on Zabbix side.

Comment by Dmitry Zykov [ 2016 Apr 28 ]

Thank you for help! I'm close issue.

Comment by Sergei Turchanov [ 2017 May 24 ]

You interpretation of man(5) of /proc/stat is incorrect :
user includes guest time
nice includes guest_nice time

It is done for compatibilty with legacy software which reads all but 'guest' fields.
You can verify that in kernel sources kernel/sched/cputime.c:account_guest_time or, for example, in mpstat (from sysstat package) pr_stats.c: print_cpu_stats.

So when zabbix computes percentage of 'user', 'sys', 'guest', 'idle', etc. you account guest time (and guest nice) TWICE.

First of all, a PROOF:

user, guest, idle queried by zabbix agent

$ while sleep 30; do echo `date` " user: " `zabbix_get -s vserver6 -k 'system.cpu.util[,user,]'`", guest: " `zabbix_get -s vserver6 -k 'system.cpu.util[,guest,]'` ", idle: " `zabbix_get -s vserver6 -k 'system.cpu.util[,idle,]'`; done
Wed May 24 12:01:25 +10 2017  user:  34.077793, guest:  33.849551 , idle:  26.761020
Wed May 24 12:01:55 +10 2017  user:  32.455763, guest:  32.267145 , idle:  29.687551
Wed May 24 12:02:25 +10 2017  user:  32.534687, guest:  32.313557 , idle:  29.268319
Wed May 24 12:02:55 +10 2017  user:  33.791551, guest:  33.507177 , idle:  26.607283
Wed May 24 12:03:25 +10 2017  user:  33.584713, guest:  33.310416 , idle:  27.115921
Wed May 24 12:03:55 +10 2017  user:  33.054221, guest:  32.804419 , idle:  28.190625
Wed May 24 12:04:25 +10 2017  user:  34.233825, guest:  33.984393 , idle:  25.854251
Wed May 24 12:04:55 +10 2017  user:  33.676474, guest:  33.417607 , idle:  27.031996

user, guest, idle queried by mpstat

NOTE: mpstat substracts guest time from user time read from /proc/stat (same for guest nice), so it reports real user time

$ mpstat 30
Linux 3.10.0-229.14.1.el7.x86_64 (vserver6.akod.loc) 	05/24/2017 	_x86_64_	(32 CPU)

12:00:55 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
12:01:25 PM  all    0.31    0.03    6.29    0.75    0.00    1.32    0.00   49.12    0.00   42.18
12:01:55 PM  all    0.25    0.03    6.34    0.53    0.00    1.30    0.00   46.52    0.00   45.03
12:02:25 PM  all    0.41    0.03    6.96    0.85    0.00    1.43    0.00   49.45    0.00   40.86
12:02:55 PM  all    0.45    0.03    6.94    0.60    0.00    1.51    0.00   51.24    0.00   39.23
12:03:25 PM  all    0.36    0.03    6.64    0.74    0.00    1.44    0.00   48.44    0.00   42.34
12:03:55 PM  all    0.38    0.02    6.95    0.49    0.00    1.42    0.00   49.53    0.00   41.20
12:04:25 PM  all    0.38    0.03    6.95    0.64    0.00    1.51    0.00   53.71    0.00   36.78
12:04:55 PM  all    0.40    0.03    6.50    0.60    0.00    1.35    0.00   46.18    0.00   44.95

As you see

  • idle is almost twice as low that real idle
  • guest is off by 1/3
  • user shows ... who knows what it shows.... (I do actually, see below)

Explanation

When zabbix-agent computes percentage of requested metric (user, guest, etc.) in src/zabbix_agent/cpustat.c:get_cpustat it divides a counter value for the metric to a total computed from all values read from /proc/stat. For example:

idle_pct = IDLE / (USER + SYS + IDLE + GUEST + ... ) = idle / ([user + guest] + sys + idle + guest + ...) = idle / (user + sys + ide + 2 * guest + ...)

UPPERCASE - values read from /proc/stat
lowercase - real values accounted by kernel

(... same for the guest nice btw)

Epilogue

Core developers have to decide whether to retain compatiblity with older client (and thus report system.cpu.util[,user,] with guest time included) or be more like mpstat (which reports a user time without a guest time). I would prefer the latter.

Comment by Glebs Ivanovskis (Inactive) [ 2017 May 24 ]

Dear plumber, thank you for reviving this discussion.

You claim that:

You interpretation of man(5) of /proc/stat is incorrect :

Can you point out the place where man 5 proc states the following information?

user includes guest time
nice includes guest_nice time

This is what I meant by references. I wouldn't really want to dive into kernel sources, I believe such information should be available somewhere in the documentation.

I believe that the problem is not with our math skills, but with information we base our calculations on.

Comment by Glebs Ivanovskis (Inactive) [ 2017 May 24 ]

Reopening. Seems this issue needs some more investigation (as a minimum).

Comment by Sergei Turchanov [ 2017 May 24 ]

Can you point out the place where man 5 proc states the following information?

No, it is not stated in man otherwise we wouldn't be having this conversation at all. But you may use sources of: mpstat / sar (from sysstat), htop, etc.

mpstat/sar:

use common code in pr_stats.c:

__print_funct_t print_cpu_stats(struct activity *a, int prev, int curr,
                                unsigned long long g_itv)
{
..
                        printf("\n%-11s     CPU      %%usr     %%nice      %%sys   %%iowait    %%steal      %%irq     %%soft"
                               "    %%guest    %%gnice     %%idle\n",  timestamp[!curr]);
...
                                /*
                                 * If the CPU is offline then it is omited from /proc/stat:
                                 * All the fields couldn't have been read and the sum of them is zero.
                                 * (Remember that guest/guest_nice times are already included in
                                 * user/nice modes.)
                                 */
...
                                printf("    %6.2f    %6.2f    %6.2f    %6.2f    %6.2f    %6.2f"
                                       "    %6.2f    %6.2f    %6.2f    %6.2f\n",
                                       (scc->cpu_user - scc->cpu_guest) < (scp->cpu_user - scp->cpu_guest) ?
                                       0.0 :
                                       ll_sp_value(scp->cpu_user - scp->cpu_guest,
                                                   scc->cpu_user - scc->cpu_guest, g_itv),
                                       (scc->cpu_nice - scc->cpu_guest_nice) < (scp->cpu_nice - scp->cpu_guest_nice) ?
                                       0.0 :
                                       ll_sp_value(scp->cpu_nice - scp->cpu_guest_nice,
                                                   scc->cpu_nice - scc->cpu_guest_nice, g_itv),
                                       ll_sp_value(scp->cpu_sys, scc->cpu_sys, g_itv),
                                       ll_sp_value(scp->cpu_iowait, scc->cpu_iowait, g_itv),
                                       ll_sp_value(scp->cpu_steal, scc->cpu_steal, g_itv),
                                       ll_sp_value(scp->cpu_hardirq, scc->cpu_hardirq, g_itv),
                                       ll_sp_value(scp->cpu_softirq, scc->cpu_softirq, g_itv),
                                       ll_sp_value(scp->cpu_guest, scc->cpu_guest, g_itv),
                                       ll_sp_value(scp->cpu_guest_nice, scc->cpu_guest_nice, g_itv),
                                       scc->cpu_idle < scp->cpu_idle ?
                                       0.0 :
                                       ll_sp_value(scp->cpu_idle, scc->cpu_idle, g_itv));
...

As you see mpstat/sar reports user time without a guest time. The man page of sar specifically states that reported values:

              %usr
                     Percentage of CPU utilization that occurred while executing at the user level (application). Note that  this  field
                     does NOT include time spent running virtual processors.
...
              %guest
                     Percentage of time spent by the CPU or CPUs to run a virtual processor.

htop

[ProcessList.c

void ProcessList_scan(ProcessList* this) {
...
file = fopen(PROCSTATFILE, "r");
...
fgets(buffer, 255, file);
...
   sscanf(buffer, "cpu  %16llu %16llu %16llu %16llu %16llu %16llu %16llu %16llu %16llu %16llu", &usertime, &nicetime, &systemtime, &idletime, &ioWait, &irq, &softIrq, &steal, &guest, &guestnice);
...
// Guest time is already accounted in usertime
      usertime = usertime - guest;
      nicetime = nicetime - guestnice;
...

top

top.c
This was discussed previously in this thread but you neglected to do the math. top doesn't read guest/guest_nice fields so when it computes a percentage the formula is like this (e.g., idle_pct):

idle_pct  = idle / (user + sys + idle + nice + iowait + intr + softintr + steal)

now what you do in zabbix-agent:

idle_pct  = idle / (user + sys + idle + nice + iowait + intr + softintr + steal + guest + guest_nice)

It is impossible to get the same results as top unless you previously substracted guest/guest_nice from user/nice

A / B != A / (B + C) when C != 0

Kernel code

You have to delve into kernel innards if you want to get the truth as the kernel is the ultimate authority.
(when you read /proc/stat you are reading cpustat[] entries)
kernel/sched/cputime.c

/*
 * Account guest cpu time to a process.
 * @p: the process that the cpu time gets accounted to
 * @cputime: the cpu time spent in virtual machine since the last update
 */
void account_guest_time(struct task_struct *p, u64 cputime)
{
	u64 *cpustat = kcpustat_this_cpu->cpustat;

	/* Add guest time to process. */
	p->utime += cputime;
	account_group_user_time(p, cputime);
	p->gtime += cputime;

	/* Add guest time to cpustat. */
	if (task_nice(p) > 0) {
		cpustat[CPUTIME_NICE] += cputime;
		cpustat[CPUTIME_GUEST_NICE] += cputime;
	} else {
		cpustat[CPUTIME_USER] += cputime;
		cpustat[CPUTIME_GUEST] += cputime;
	}
}

There was an attempt to clarify this lkml.org archived thread (see other messages in that thread) but it didn't land into kernel for god-knows-why reasons.

Comment by Glebs Ivanovskis (Inactive) [ 2017 May 24 ]

Thank you for this wonderful research! That's really a pity that documentation wasn't updated in 9 years.

It's a bit risky to take action based on source code because it's impossible to follow all the development and stay up-to-date with latest changes in all platforms Zabbix needs to support. We don't have enough time to sort out all problems in our own code.

I'm not a decision maker, but I passed your information further. That's all I can do, I'm afraid. Stay optimistic!

Comment by Sergei Turchanov [ 2017 May 25 ]

It's a bit risky to take action based on source code ...

Anyway, you won't deny the fact that idle_pct reported by zabbix-agent and by mpstat/sar is different? So there is a bug. You may propose another explanation for that

As for passing the information further, you have to decide whether to retain compatiblity with older client (and thus report system.cpu.util[,user,] with guest time included) or be more like mpstat/sar (which report a user time without a guest time). I would prefer a mpstat/sar way.

Comment by Glebs Ivanovskis (Inactive) [ 2017 May 25 ]

I do not deny that values are different. I do not even deny that there could be a bug in Zabbix.

I'm just saying that if you spent some of your energy pushing kernel documentation maintainers it would be beneficial for everyone developing monitoring tools. And would also make us more willing to change calculations in Zabbix agent.

Comment by Austin Cormier [ 2017 Aug 28 ]

The following minimal patch seems to solve the issue of double counting the guest time (using Sergei's advice above). This does not provide the same "user" statistic as top, but gives you a sane stacked graph for CPU utilization when you include the "guest". The idle stat is fixed with this as well.

--- zabbix-3.2.7.orig/src/zabbix_agent/cpustat.c
+++ zabbix-3.2.7/src/zabbix_agent/cpustat.c
@@ -395,6 +395,9 @@ static void	update_cpustats(ZBX_CPUS_STA
 				&counter[ZBX_CPU_STATE_SOFTIRQ], &counter[ZBX_CPU_STATE_STEAL],
 				&counter[ZBX_CPU_STATE_GCPU], &counter[ZBX_CPU_STATE_GNICE]);

+		counter[ZBX_CPU_STATE_USER] -= counter[ZBX_CPU_STATE_GCPU];
+		counter[ZBX_CPU_STATE_NICE] -= counter[ZBX_CPU_STATE_GNICE];
+
 		update_cpu_counters(&pcpus->cpu[idx], counter);
 		cpu_status[idx] = SYSINFO_RET_OK;
 	}

Attached the CPU utilization graph which shows the result of the change (the change at 9:15pm).

Comment by Austin Cormier [ 2017 Oct 27 ]

To be honest, before visiting this thread I had never even heard of a separate "guest" statistic since top is the primary tool I've used. I'm sure there are good use cases for knowing the CPU used by the KVM processes as opposed to the hypervisor processes, but I think this is a less-common use case than just knowing the overall utilization.

My vote would be #1 but to keep the seperate guest/gnice items. I think this would provide the path of least surprise, remain compatible with Zabbix 2.0, and still provide an avenue to determine non-guest CPU.

Comment by Valdis Kauķis (Inactive) [ 2017 Oct 27 ]

I recommend the first, "top" choice also, as it is conservative and does not break existing graphs and templates. The sum of first 8 values stays 100%. We now have two extra, but even more extra states might be introduced in the future, if the 100% formula is changing, should everyone change their graphs and templates again? Whoever is interested in guest time, can create appropriate calculated items and graphs, without breaking compatibility with previous Zabbix agents.

Since r48979, Zabbix 2.5.0 matches neither of the established options, broken in ZBXNEXT-2325.

Comment by Rostislav Palivoda (Inactive) [ 2017 Nov 02 ]

Design day decision: mpstat. All 10 statuses together will be 100%

Comment by Valdis Kauķis (Inactive) [ 2017 Nov 06 ]

Fixed in svn://svn.zabbix.com/branches/dev/ZBX-10710 r74292

Comment by Vladislavs Sokurenko [ 2017 Nov 07 ]

Successfully tested

Comment by Valdis Kauķis (Inactive) [ 2017 Nov 10 ]

Fixed in:

  • pre-3.0.14rc1 r74451
  • pre-3.4.5rc1 r74454
  • pre-4.0.0alpha1 (trunk) r74456
Generated at Tue May 27 21:14:24 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.