[ZBX-10710] system.cpu.util show incorrect utilization Created: 2016 Apr 26 Updated: 2024 Apr 10 Resolved: 2017 Nov 23 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent (G) |
Affects Version/s: | 3.0.2 |
Fix Version/s: | 3.0.14rc1, 3.4.5rc1, 4.0.0alpha1, 4.0 (plan) |
Type: | Problem report | Priority: | Critical |
Reporter: | Dmitry Zykov | Assignee: | Valdis Kauķis (Inactive) |
Resolution: | Fixed | Votes: | 1 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
zabbix_agentd (daemon) (Zabbix) 3.0.2 CentOS Linux release 7.2.1511 (Core) Linux 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Intel(R) Xeon(R) CPU D-1540 @ 2.00GHz |
Attachments: |
![]() ![]() ![]() |
||||||||||||
Issue Links: |
|
||||||||||||
Team: | |||||||||||||
Sprint: | Sprint 19, Sprint 20, Sprint 21 | ||||||||||||
Story Points: | 6 |
Description |
The zabbix agent send incorrect utilization of cpu which twice less than output of top command: On agent server: [root@xxx]# top|grep Cpu %Cpu(s): 84.6 us, 8.3 sy, 0.0 ni, 5.5 id, 0.0 wa, 0.0 hi, 1.6 si, 0.0 st %Cpu(s): 82.8 us, 8.7 sy, 0.0 ni, 6.9 id, 0.0 wa, 0.0 hi, 1.6 si, 0.0 st %Cpu(s): 84.9 us, 9.4 sy, 0.0 ni, 4.0 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st %Cpu(s): 82.7 us, 9.1 sy, 0.0 ni, 6.5 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st %Cpu(s): 84.1 us, 8.1 sy, 0.0 ni, 6.0 id, 0.1 wa, 0.0 hi, 1.7 si, 0.0 st %Cpu(s): 84.9 us, 8.1 sy, 0.0 ni, 5.6 id, 0.0 wa, 0.0 hi, 1.4 si, 0.0 st %Cpu(s): 84.3 us, 7.9 sy, 0.0 ni, 6.2 id, 0.0 wa, 0.0 hi, 1.6 si, 0.0 st At the same time on zabbix server: [root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[] 49.139935 [root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[,user] 49.171261 [root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[,system] 4.926904 [root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[] 49.079076 [root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[,user] 49.082120 [root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[,system] 4.890176 And the same incorrect values for another server: [root@xxx2 ~]# top|grep Cpu %Cpu(s): 96.6 us, 2.7 sy, 0.0 ni, 0.1 id, 0.0 wa, 0.0 hi, 0.6 si, 0.0 st %Cpu(s): 96.5 us, 2.8 sy, 0.0 ni, 0.2 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st %Cpu(s): 96.3 us, 3.0 sy, 0.0 ni, 0.1 id, 0.0 wa, 0.0 hi, 0.6 si, 0.0 st %Cpu(s): 94.3 us, 4.6 sy, 0.0 ni, 0.3 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st %Cpu(s): 96.2 us, 2.7 sy, 0.0 ni, 0.2 id, 0.0 wa, 0.0 hi, 1.0 si, 0.0 st %Cpu(s): 96.2 us, 2.7 sy, 0.0 ni, 0.4 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st %Cpu(s): 95.2 us, 2.9 sy, 0.0 ni, 1.1 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st %Cpu(s): 95.9 us, 2.7 sy, 0.0 ni, 0.7 id, 0.0 wa, 0.0 hi, 0.6 si, 0.0 st %Cpu(s): 96.4 us, 2.6 sy, 0.0 ni, 0.5 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st [root@ bin]# zabbix_get -sxxx2 -p10050 -k system.cpu.util[] 49.773875 [root@yyy bin]# zabbix_get -sxxx2 -p10050 -k system.cpu.util[,user] 49.734167 [root@yyy bin]# zabbix_get -sxxx2 -p10050 -k system.cpu.util[,system] 1.461466 [root@yyy bin]# zabbix_get -sxxx2 -p10050 -k system.cpu.util[] 49.714295 [root@yyy bin]# zabbix_get -sxxx2 -p10050 -k system.cpu.util[,user] 49.711913 [root@yyy bin]# zabbix_get -sxxx2 -p10050 -k system.cpu.util[,system] 1.471862 In attach the screenshot for cpu.util from this 2 servers. |
Comments |
Comment by Aleksandrs Saveljevs [ 2016 Apr 27 ] |
Did the issue start after an upgrade? Did it work correctly with a previous version of Zabbix? |
Comment by Dmitry Zykov [ 2016 Apr 27 ] |
This is new server with newest agent. I had try older versions of the agent on this server, this bug appeared from 3.0.0 zabbix agent 3.0.1 [root@xxx /]# zabbix_agentd -V zabbix_agentd (daemon) (Zabbix) 3.0.1 Revision 58734 26 February 2016, compilation time: Feb 28 2016 02:15:42 ... [root@xxx /]# top|grep Cpu %Cpu(s): 77.6 us, 7.0 sy, 0.0 ni, 14.0 id, 0.0 wa, 0.0 hi, 1.3 si, 0.0 st %Cpu(s): 79.1 us, 8.1 sy, 0.0 ni, 11.2 id, 0.1 wa, 0.0 hi, 1.4 si, 0.0 st %Cpu(s): 78.2 us, 7.9 sy, 0.0 ni, 12.5 id, 0.1 wa, 0.0 hi, 1.3 si, 0.0 st %Cpu(s): 79.1 us, 8.1 sy, 0.0 ni, 11.3 id, 0.1 wa, 0.0 hi, 1.4 si, 0.0 st %Cpu(s): 76.1 us, 8.2 sy, 0.0 ni, 14.4 id, 0.0 wa, 0.0 hi, 1.2 si, 0.0 st zabbix server at this time [root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[] 45.330975 [root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[,user] 45.231920 [root@yyy bin]# zabbix_get -sxxx -p10050 -k system.cpu.util[,system] 5.140191 zabbix agent 3.0.0 [root@xxx /]# zabbix_agentd -V zabbix_agentd (daemon) (Zabbix) 3.0.0 Revision 58460 15 February 2016, compilation time: Feb 20 2016 04:32:59 [root@xxx /]# top|grep Cpu %Cpu(s): 24.0 us, 5.3 sy, 0.0 ni, 69.9 id, 0.3 wa, 0.0 hi, 0.6 si, 0.0 st %Cpu(s): 83.0 us, 8.4 sy, 0.0 ni, 6.8 id, 0.2 wa, 0.0 hi, 1.6 si, 0.0 st %Cpu(s): 81.9 us, 9.2 sy, 0.0 ni, 6.8 id, 0.4 wa, 0.0 hi, 1.7 si, 0.0 st %Cpu(s): 81.2 us, 8.0 sy, 0.0 ni, 8.8 id, 0.4 wa, 0.0 hi, 1.6 si, 0.0 st %Cpu(s): 80.3 us, 8.3 sy, 0.0 ni, 9.6 id, 0.3 wa, 0.0 hi, 1.6 si, 0.0 st %Cpu(s): 74.5 us, 8.5 sy, 0.0 ni, 14.8 id, 0.5 wa, 0.0 hi, 1.7 si, 0.0 st %Cpu(s): 78.8 us, 9.7 sy, 0.0 ni, 9.4 id, 0.3 wa, 0.0 hi, 1.8 si, 0.0 st zabbix server at this time [root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[] 45.510419 [root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[,user] 45.407713 [root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[,system] 4.840851 zabbix agent 2.4.7 [root@xxx /]# zabbix_agentd -V Zabbix Agent (daemon) v2.4.7 (revision 56694) (12 November 2015) Compilation time: Nov 13 2015 10:42:17 %Cpu(s): 79.4 us, 8.6 sy, 0.0 ni, 10.0 id, 0.0 wa, 0.0 hi, 2.0 si, 0.0 st %Cpu(s): 79.7 us, 8.6 sy, 0.0 ni, 9.6 id, 0.1 wa, 0.0 hi, 2.0 si, 0.0 st %Cpu(s): 80.4 us, 8.1 sy, 0.0 ni, 9.4 id, 0.1 wa, 0.0 hi, 2.0 si, 0.0 st %Cpu(s): 75.5 us, 9.0 sy, 0.0 ni, 13.4 id, 0.1 wa, 0.0 hi, 2.0 si, 0.0 st %Cpu(s): 71.3 us, 8.0 sy, 0.0 ni, 18.8 id, 0.1 wa, 0.0 hi, 1.7 si, 0.0 st %Cpu(s): 76.6 us, 8.4 sy, 0.0 ni, 13.0 id, 0.1 wa, 0.0 hi, 2.0 si, 0.0 st zabbix server at this time [root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[] 67.133790 [root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[,user] 67.072900 [root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[,system] 8.109336 [root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[] 67.281369 [root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[,user] 67.043161 [root@yyy bin]# zabbix_get -sxxx. -p10050 -k system.cpu.util[,system] |
Comment by Aleksandrs Saveljevs [ 2016 Apr 27 ] |
Thank you for the information! Do you use any virtualization technologies and are CPUs allocated dynamically? |
Comment by Dmitry Zykov [ 2016 Apr 27 ] |
Yes, the main role of this servers is KVM virtualization where CPUs allocated dynamically. |
Comment by Aleksandrs Saveljevs [ 2016 Apr 27 ] |
Could you please try to monitor "system.cpu.util[,guest]" and "system.cpu.util[,guest_nice]" items (see https://www.zabbix.com/documentation/3.0/manual/config/items/itemtypes/zabbix_agent )? If you include these items, do the CPU items add up to 100% then? |
Comment by Aleksandrs Saveljevs [ 2016 Apr 27 ] |
Somewhat related issue: ZBX-9786. |
Comment by Dmitry Zykov [ 2016 Apr 28 ] |
zabbix_agentd (daemon) (Zabbix) 3.0.2 Revision 59540 20 April 2016, compilation time: Apr 20 2016 14:42:06 # top|grep Cpu %Cpu(s): 95.1 us, 3.6 sy, 0.0 ni, 0.6 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st %Cpu(s): 95.7 us, 3.3 sy, 0.0 ni, 0.4 id, 0.0 wa, 0.0 hi, 0.6 si, 0.0 st %Cpu(s): 96.0 us, 3.1 sy, 0.0 ni, 0.2 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st %Cpu(s): 96.1 us, 2.9 sy, 0.0 ni, 0.4 id, 0.0 wa, 0.0 hi, 0.6 si, 0.0 st %Cpu(s): 94.0 us, 3.8 sy, 0.0 ni, 1.5 id, 0.0 wa, 0.0 hi, 0.6 si, 0.0 st %Cpu(s): 95.3 us, 3.2 sy, 0.0 ni, 0.8 id, 0.1 wa, 0.0 hi, 0.7 si, 0.0 st %Cpu(s): 92.9 us, 4.7 sy, 0.0 ni, 1.3 id, 0.0 wa, 0.0 hi, 1.1 si, 0.0 st %Cpu(s): 92.0 us, 5.9 sy, 0.0 ni, 1.0 id, 0.0 wa, 0.0 hi, 1.0 si, 0.0 st %Cpu(s): 95.4 us, 3.5 sy, 0.0 ni, 0.2 id, 0.0 wa, 0.0 hi, 1.0 si, 0.0 st while true; do echo -n 'total: ' && zabbix_get -sxxx -p10050 -k system.cpu.util[]; echo -n 'user: ' && zabbix_get -sxxx -p10050 -k system.cpu.util[,user]; echo -n 'system: '&& zabbix_get -sxxx -p10050 -k system.cpu.util[,system]; echo -n 'guest: ' && zabbix_get -sxxx -p10050 -k system.cpu.util[,guest]; echo -n 'guest_nice: ' && zabbix_get -sxxx -p10050 -k system.cpu.util[,guest_nice]; echo '------------' sleep 1; done; total: 51.249574 user: 51.249574 system: 2.003069 guest: 45.486419 guest_nice: 0.000000 ------------ total: 51.242526 user: 51.242526 system: 1.999636 guest: 45.477230 guest_nice: 0.000000 ------------ total: 51.212707 user: 51.212707 system: 2.002614 guest: 45.485026 guest_nice: 0.000000 ------------ |
Comment by Aleksandrs Saveljevs [ 2016 Apr 28 ] |
Great! So it seems that everything is correct - Zabbix counts "guest" time separately, while "top" seems to add it to "user" time. A fragment from "man proc": /proc/stat kernel/system statistics. Varies with architecture. Common entries include: cpu 3357 0 4313 1362393 The amount of time, measured in units of USER_HZ (1/100ths of a second on most architectures, use sysconf(_SC_CLK_TCK) to obtain the right value), that the system spent in various states: user (1) Time spent in user mode. nice (2) Time spent in user mode with low priority (nice). system (3) Time spent in system mode. idle (4) Time spent in the idle task. This value should be USER_HZ times the second entry in the /proc/uptime pseudo-file. iowait (since Linux 2.5.41) (5) Time waiting for I/O to complete. irq (since Linux 2.6.0-test4) (6) Time servicing interrupts. softirq (since Linux 2.6.0-test4) (7) Time servicing softirqs. steal (since Linux 2.6.11) (8) Stolen time, which is the time spent in other operating systems when running in a virtualized environment guest (since Linux 2.6.24) (9) Time spent running a virtual CPU for guest operating systems under the control of the Linux kernel. guest_nice (since Linux 2.6.33) (10) Time spent running a niced guest (virtual CPU for guest operating systems under the control of the Linux kernel). |
Comment by Dmitry Zykov [ 2016 Apr 28 ] |
I add system.cpu.util[,guest] to the graph, now it seems OK. Add screenshot of this. But agent CPU utilisation type: total (default) is still bugged, it's not include "guest" time. |
Comment by Dmitry Zykov [ 2016 Apr 28 ] |
And the system time is still twice less, then in top output. |
Comment by Aleksandrs Saveljevs [ 2016 Apr 28 ] |
Note that according to https://www.zabbix.com/documentation/3.0/manual/config/items/itemtypes/zabbix_agent the default is not "total" (there is no such value), but "user". |
Comment by Aleksandrs Saveljevs [ 2016 Apr 28 ] |
The last part seems to be wrong - if "top" shows the same results as Zabbix 2.4, then "top" simply ignores "guest" time, not adds it to "user" time. Taking a brief look at "top" source code at http://procps.sourceforge.net/index.html seems to confirm it. The program only reads 8 values from /proc/stat:
num = sscanf(buf, "cpu %Lu %Lu %Lu %Lu %Lu %Lu %Lu %Lu",
&cpus[Cpu_tot].u,
&cpus[Cpu_tot].n,
&cpus[Cpu_tot].s,
&cpus[Cpu_tot].i,
&cpus[Cpu_tot].w,
&cpus[Cpu_tot].x,
&cpus[Cpu_tot].y,
&cpus[Cpu_tot].z
);
Zabbix, since version 3.0, reads 10 values: sscanf(line, "%*s " ZBX_FS_UI64 " " ZBX_FS_UI64 " " ZBX_FS_UI64 " " ZBX_FS_UI64 " " ZBX_FS_UI64 " " ZBX_FS_UI64 " " ZBX_FS_UI64 " " ZBX_FS_UI64 " " ZBX_FS_UI64 " " ZBX_FS_UI64, &counter[ZBX_CPU_STATE_USER], &counter[ZBX_CPU_STATE_NICE], &counter[ZBX_CPU_STATE_SYSTEM], &counter[ZBX_CPU_STATE_IDLE], &counter[ZBX_CPU_STATE_IOWAIT], &counter[ZBX_CPU_STATE_INTERRUPT], &counter[ZBX_CPU_STATE_SOFTIRQ], &counter[ZBX_CPU_STATE_STEAL], &counter[ZBX_CPU_STATE_GCPU], &counter[ZBX_CPU_STATE_GNICE]); So it currently seems like there is nothing to fix on Zabbix side. |
Comment by Dmitry Zykov [ 2016 Apr 28 ] |
Thank you for help! I'm close issue. |
Comment by Sergei Turchanov [ 2017 May 24 ] |
You interpretation of man(5) of /proc/stat is incorrect : It is done for compatibilty with legacy software which reads all but 'guest' fields. So when zabbix computes percentage of 'user', 'sys', 'guest', 'idle', etc. you account guest time (and guest nice) TWICE. First of all, a PROOF:user, guest, idle queried by zabbix agent$ while sleep 30; do echo `date` " user: " `zabbix_get -s vserver6 -k 'system.cpu.util[,user,]'`", guest: " `zabbix_get -s vserver6 -k 'system.cpu.util[,guest,]'` ", idle: " `zabbix_get -s vserver6 -k 'system.cpu.util[,idle,]'`; done Wed May 24 12:01:25 +10 2017 user: 34.077793, guest: 33.849551 , idle: 26.761020 Wed May 24 12:01:55 +10 2017 user: 32.455763, guest: 32.267145 , idle: 29.687551 Wed May 24 12:02:25 +10 2017 user: 32.534687, guest: 32.313557 , idle: 29.268319 Wed May 24 12:02:55 +10 2017 user: 33.791551, guest: 33.507177 , idle: 26.607283 Wed May 24 12:03:25 +10 2017 user: 33.584713, guest: 33.310416 , idle: 27.115921 Wed May 24 12:03:55 +10 2017 user: 33.054221, guest: 32.804419 , idle: 28.190625 Wed May 24 12:04:25 +10 2017 user: 34.233825, guest: 33.984393 , idle: 25.854251 Wed May 24 12:04:55 +10 2017 user: 33.676474, guest: 33.417607 , idle: 27.031996 user, guest, idle queried by mpstatNOTE: mpstat substracts guest time from user time read from /proc/stat (same for guest nice), so it reports real user time $ mpstat 30 Linux 3.10.0-229.14.1.el7.x86_64 (vserver6.akod.loc) 05/24/2017 _x86_64_ (32 CPU) 12:00:55 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 12:01:25 PM all 0.31 0.03 6.29 0.75 0.00 1.32 0.00 49.12 0.00 42.18 12:01:55 PM all 0.25 0.03 6.34 0.53 0.00 1.30 0.00 46.52 0.00 45.03 12:02:25 PM all 0.41 0.03 6.96 0.85 0.00 1.43 0.00 49.45 0.00 40.86 12:02:55 PM all 0.45 0.03 6.94 0.60 0.00 1.51 0.00 51.24 0.00 39.23 12:03:25 PM all 0.36 0.03 6.64 0.74 0.00 1.44 0.00 48.44 0.00 42.34 12:03:55 PM all 0.38 0.02 6.95 0.49 0.00 1.42 0.00 49.53 0.00 41.20 12:04:25 PM all 0.38 0.03 6.95 0.64 0.00 1.51 0.00 53.71 0.00 36.78 12:04:55 PM all 0.40 0.03 6.50 0.60 0.00 1.35 0.00 46.18 0.00 44.95 As you see
ExplanationWhen zabbix-agent computes percentage of requested metric (user, guest, etc.) in src/zabbix_agent/cpustat.c:get_cpustat it divides a counter value for the metric to a total computed from all values read from /proc/stat. For example: idle_pct = IDLE / (USER + SYS + IDLE + GUEST + ... ) = idle / ([user + guest] + sys + idle + guest + ...) = idle / (user + sys + ide + 2 * guest + ...) UPPERCASE - values read from /proc/stat (... same for the guest nice btw) EpilogueCore developers have to decide whether to retain compatiblity with older client (and thus report system.cpu.util[,user,] with guest time included) or be more like mpstat (which reports a user time without a guest time). I would prefer the latter. |
Comment by Glebs Ivanovskis (Inactive) [ 2017 May 24 ] |
Dear plumber, thank you for reviving this discussion. You claim that:
Can you point out the place where man 5 proc states the following information?
This is what I meant by references. I wouldn't really want to dive into kernel sources, I believe such information should be available somewhere in the documentation. I believe that the problem is not with our math skills, but with information we base our calculations on. |
Comment by Glebs Ivanovskis (Inactive) [ 2017 May 24 ] |
Reopening. Seems this issue needs some more investigation (as a minimum). |
Comment by Sergei Turchanov [ 2017 May 24 ] |
No, it is not stated in man otherwise we wouldn't be having this conversation at all. But you may use sources of: mpstat / sar (from sysstat), htop, etc. mpstat/sar:use common code in pr_stats.c: __print_funct_t print_cpu_stats(struct activity *a, int prev, int curr, unsigned long long g_itv) { .. printf("\n%-11s CPU %%usr %%nice %%sys %%iowait %%steal %%irq %%soft" " %%guest %%gnice %%idle\n", timestamp[!curr]); ... /* * If the CPU is offline then it is omited from /proc/stat: * All the fields couldn't have been read and the sum of them is zero. * (Remember that guest/guest_nice times are already included in * user/nice modes.) */ ... printf(" %6.2f %6.2f %6.2f %6.2f %6.2f %6.2f" " %6.2f %6.2f %6.2f %6.2f\n", (scc->cpu_user - scc->cpu_guest) < (scp->cpu_user - scp->cpu_guest) ? 0.0 : ll_sp_value(scp->cpu_user - scp->cpu_guest, scc->cpu_user - scc->cpu_guest, g_itv), (scc->cpu_nice - scc->cpu_guest_nice) < (scp->cpu_nice - scp->cpu_guest_nice) ? 0.0 : ll_sp_value(scp->cpu_nice - scp->cpu_guest_nice, scc->cpu_nice - scc->cpu_guest_nice, g_itv), ll_sp_value(scp->cpu_sys, scc->cpu_sys, g_itv), ll_sp_value(scp->cpu_iowait, scc->cpu_iowait, g_itv), ll_sp_value(scp->cpu_steal, scc->cpu_steal, g_itv), ll_sp_value(scp->cpu_hardirq, scc->cpu_hardirq, g_itv), ll_sp_value(scp->cpu_softirq, scc->cpu_softirq, g_itv), ll_sp_value(scp->cpu_guest, scc->cpu_guest, g_itv), ll_sp_value(scp->cpu_guest_nice, scc->cpu_guest_nice, g_itv), scc->cpu_idle < scp->cpu_idle ? 0.0 : ll_sp_value(scp->cpu_idle, scc->cpu_idle, g_itv)); ... As you see mpstat/sar reports user time without a guest time. The man page of sar specifically states that reported values: %usr Percentage of CPU utilization that occurred while executing at the user level (application). Note that this field does NOT include time spent running virtual processors. ... %guest Percentage of time spent by the CPU or CPUs to run a virtual processor. htopvoid ProcessList_scan(ProcessList* this) { ... file = fopen(PROCSTATFILE, "r"); ... fgets(buffer, 255, file); ... sscanf(buffer, "cpu %16llu %16llu %16llu %16llu %16llu %16llu %16llu %16llu %16llu %16llu", &usertime, &nicetime, &systemtime, &idletime, &ioWait, &irq, &softIrq, &steal, &guest, &guestnice); ... // Guest time is already accounted in usertime usertime = usertime - guest; nicetime = nicetime - guestnice; ... toptop.c idle_pct = idle / (user + sys + idle + nice + iowait + intr + softintr + steal) now what you do in zabbix-agent: idle_pct = idle / (user + sys + idle + nice + iowait + intr + softintr + steal + guest + guest_nice) It is impossible to get the same results as top unless you previously substracted guest/guest_nice from user/nice A / B != A / (B + C) when C != 0 Kernel codeYou have to delve into kernel innards if you want to get the truth as the kernel is the ultimate authority. /* * Account guest cpu time to a process. * @p: the process that the cpu time gets accounted to * @cputime: the cpu time spent in virtual machine since the last update */ void account_guest_time(struct task_struct *p, u64 cputime) { u64 *cpustat = kcpustat_this_cpu->cpustat; /* Add guest time to process. */ p->utime += cputime; account_group_user_time(p, cputime); p->gtime += cputime; /* Add guest time to cpustat. */ if (task_nice(p) > 0) { cpustat[CPUTIME_NICE] += cputime; cpustat[CPUTIME_GUEST_NICE] += cputime; } else { cpustat[CPUTIME_USER] += cputime; cpustat[CPUTIME_GUEST] += cputime; } } There was an attempt to clarify this lkml.org archived thread (see other messages in that thread) but it didn't land into kernel for god-knows-why reasons. |
Comment by Glebs Ivanovskis (Inactive) [ 2017 May 24 ] |
Thank you for this wonderful research! That's really a pity that documentation wasn't updated in 9 years. It's a bit risky to take action based on source code because it's impossible to follow all the development and stay up-to-date with latest changes in all platforms Zabbix needs to support. We don't have enough time to sort out all problems in our own code. I'm not a decision maker, but I passed your information further. That's all I can do, I'm afraid. Stay optimistic! |
Comment by Sergei Turchanov [ 2017 May 25 ] |
Anyway, you won't deny the fact that idle_pct reported by zabbix-agent and by mpstat/sar is different? So there is a bug. You may propose another explanation for that As for passing the information further, you have to decide whether to retain compatiblity with older client (and thus report system.cpu.util[,user,] with guest time included) or be more like mpstat/sar (which report a user time without a guest time). I would prefer a mpstat/sar way. |
Comment by Glebs Ivanovskis (Inactive) [ 2017 May 25 ] |
I do not deny that values are different. I do not even deny that there could be a bug in Zabbix. I'm just saying that if you spent some of your energy pushing kernel documentation maintainers it would be beneficial for everyone developing monitoring tools. And would also make us more willing to change calculations in Zabbix agent. |
Comment by Austin Cormier [ 2017 Aug 28 ] |
The following minimal patch seems to solve the issue of double counting the guest time (using Sergei's advice above). This does not provide the same "user" statistic as top, but gives you a sane stacked graph for CPU utilization when you include the "guest". The idle stat is fixed with this as well. --- zabbix-3.2.7.orig/src/zabbix_agent/cpustat.c +++ zabbix-3.2.7/src/zabbix_agent/cpustat.c @@ -395,6 +395,9 @@ static void update_cpustats(ZBX_CPUS_STA &counter[ZBX_CPU_STATE_SOFTIRQ], &counter[ZBX_CPU_STATE_STEAL], &counter[ZBX_CPU_STATE_GCPU], &counter[ZBX_CPU_STATE_GNICE]); + counter[ZBX_CPU_STATE_USER] -= counter[ZBX_CPU_STATE_GCPU]; + counter[ZBX_CPU_STATE_NICE] -= counter[ZBX_CPU_STATE_GNICE]; + update_cpu_counters(&pcpus->cpu[idx], counter); cpu_status[idx] = SYSINFO_RET_OK; } Attached the CPU utilization graph which shows the result of the change (the change at 9:15pm). |
Comment by Austin Cormier [ 2017 Oct 27 ] |
To be honest, before visiting this thread I had never even heard of a separate "guest" statistic since top is the primary tool I've used. I'm sure there are good use cases for knowing the CPU used by the KVM processes as opposed to the hypervisor processes, but I think this is a less-common use case than just knowing the overall utilization. My vote would be #1 but to keep the seperate guest/gnice items. I think this would provide the path of least surprise, remain compatible with Zabbix 2.0, and still provide an avenue to determine non-guest CPU. |
Comment by Valdis Kauķis (Inactive) [ 2017 Oct 27 ] |
I recommend the first, "top" choice also, as it is conservative and does not break existing graphs and templates. The sum of first 8 values stays 100%. We now have two extra, but even more extra states might be introduced in the future, if the 100% formula is changing, should everyone change their graphs and templates again? Whoever is interested in guest time, can create appropriate calculated items and graphs, without breaking compatibility with previous Zabbix agents. Since r48979, Zabbix 2.5.0 matches neither of the established options, broken in |
Comment by Rostislav Palivoda (Inactive) [ 2017 Nov 02 ] |
Design day decision: mpstat. All 10 statuses together will be 100% |
Comment by Valdis Kauķis (Inactive) [ 2017 Nov 06 ] |
Fixed in svn://svn.zabbix.com/branches/dev/ZBX-10710 r74292 |
Comment by Vladislavs Sokurenko [ 2017 Nov 07 ] |
Successfully tested |
Comment by Valdis Kauķis (Inactive) [ 2017 Nov 10 ] |
Fixed in:
|