[ZBX-11902] CPU monitoring issues in AIX 7.1 Created: 2017 Mar 13 Updated: 2018 Oct 09 Resolved: 2017 Oct 30 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Agent (G) |
Affects Version/s: | None |
Fix Version/s: | 3.0.13rc1, 3.2.10rc1, 3.4.4rc1, 4.0.0alpha1, 4.0 (plan) |
Type: | Problem report | Priority: | Blocker |
Reporter: | Kim Jongkwon | Assignee: | Viktors Tjarve |
Resolution: | Fixed | Votes: | 0 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
AIX 7.1 (7100-04) |
Issue Links: |
|
||||||||||||||||
Team: | Team A | ||||||||||||||||
Sprint: | Sprint 4, Sprint 5, Sprint 6, Sprint 7, Sprint 8, Sprint 9, Sprint 10, Sprint 11, Sprint 12, Sprint 13, Sprint 14, Sprint 15, Sprint 16, Sprint 17, Sprint 18, Sprint 19, Sprint 20 | ||||||||||||||||
Story Points: | 4 |
Description |
mpstat and sar commands to check the cpu utilisation but the results are not matching. 1. Discovery of CPU cores
[{"data":[ {"{#CPU.NUMBER}":0,"{#CPU.STATUS}":"online"}, {"{#CPU.NUMBER}":1,"{#CPU.STATUS}":"online"}, {"{#CPU.NUMBER}":2,"{#CPU.STATUS}":"online"}, {"{#CPU.NUMBER}":3,"{#CPU.STATUS}":"online"}, {"{#CPU.NUMBER}":4,"{#CPU.STATUS}":"online"}, {"{#CPU.NUMBER}":5,"{#CPU.STATUS}":"offline"}, {"{#CPU.NUMBER}":6,"{#CPU.STATUS}":"offline"}, {"{#CPU.NUMBER}":7,"{#CPU.STATUS}":"offline"} ]}] 2. Discovered CPUs items by LLD
cpu(0~3) is fine. cpu(4) return a strange values. cpu.util idle gets '0%' only system.cpu.util[4,idle,avg1] cpu(5~7) is Not supported : 'Cannot obtain CPU information.' system.cpu.util[5,idle,avg1] system.cpu.util[6,idle,avg1] system.cpu.util[7,idle,avg1]
|
Comments |
Comment by Andris Mednis [ 2017 Jul 14 ] |
I tried to investigate in version 3.0.10rc1 why system.cpu.discovery[] discovers one more CPU than expected. $ mpstat System configuration: lcpu=4 ent=0.2 mode=Uncapped cpu min maj mpc int cs ics rq mig lpa sysc us sy wa id pc %ec lcs 0 40399094 90876 420 85977844 37275265 664954 1 67467 100 130450383 34 49 0 17 0.00 0.5 66159248 1 2502320 6055 143 8423769 312954 74135 0 58083 100 3079389 18 27 0 54 0.00 0.1 8613174 2 20462156 39412 140 9474179 274846 67835 0 53635 100 21821343 68 23 0 9 0.00 0.2 9699004 3 604136 2211 141 3680381 102788 40736 0 40328 100 550318 14 26 0 60 0.00 0.0 3729658 U - - - - - - - - - - - - 0 99 0.20 99.2 - ALL 63967706 138554 844 107556173 37965853 847660 1 219513 100 155901433 0 0 0 99 0.00 0.8 88201084 Zabbix correctly reports number of CPUs: $ /usr/local/bin/zabbix_get -s localhost -k system.cpu.num 4 However LLD discovers 5 CPUs in state "online" (reformatted for readability): $ /usr/local/bin/zabbix_get -s localhost -k system.cpu.discovery {"data":[ {"{#CPU.NUMBER}":0,"{#CPU.STATUS}":"online"}, {"{#CPU.NUMBER}":1,"{#CPU.STATUS}":"online"}, {"{#CPU.NUMBER}":2,"{#CPU.STATUS}":"online"}, {"{#CPU.NUMBER}":3,"{#CPU.STATUS}":"online"}, {"{#CPU.NUMBER}":4,"{#CPU.STATUS}":"online"}, {"{#CPU.NUMBER}":5,"{#CPU.STATUS}":"offline"}, {"{#CPU.NUMBER}":6,"{#CPU.STATUS}":"offline"}, {"{#CPU.NUMBER}":7,"{#CPU.STATUS}":"offline"}, {"{#CPU.NUMBER}":8,"{#CPU.STATUS}":"offline"}, {"{#CPU.NUMBER}":9,"{#CPU.STATUS}":"offline"}, {"{#CPU.NUMBER}":10,"{#CPU.STATUS}":"offline"}, {"{#CPU.NUMBER}":11,"{#CPU.STATUS}":"offline"}, {"{#CPU.NUMBER}":12,"{#CPU.STATUS}":"offline"}, {"{#CPU.NUMBER}":13,"{#CPU.STATUS}":"offline"}, {"{#CPU.NUMBER}":14,"{#CPU.STATUS}":"offline"}, {"{#CPU.NUMBER}":15,"{#CPU.STATUS}":"offline"} ]} Running script /usr/local/bin/zabbix_get -s localhost -k system.cpu.util[0,idle,avg5] /usr/local/bin/zabbix_get -s localhost -k system.cpu.util[1,idle,avg5] /usr/local/bin/zabbix_get -s localhost -k system.cpu.util[2,idle,avg5] /usr/local/bin/zabbix_get -s localhost -k system.cpu.util[3,idle,avg5] /usr/local/bin/zabbix_get -s localhost -k system.cpu.util[4,idle,avg5] /usr/local/bin/zabbix_get -s localhost -k system.cpu.util[5,idle,avg5] /usr/local/bin/zabbix_get -s localhost -k system.cpu.util[6,idle,avg5] /usr/local/bin/zabbix_get -s localhost -k system.cpu.util[7,idle,avg5] /usr/local/bin/zabbix_get -s localhost -k system.cpu.util[8,idle,avg5] /usr/local/bin/zabbix_get -s localhost -k system.cpu.util[9,idle,avg5] /usr/local/bin/zabbix_get -s localhost -k system.cpu.util[10,idle,avg5] /usr/local/bin/zabbix_get -s localhost -k system.cpu.util[11,idle,avg5] /usr/local/bin/zabbix_get -s localhost -k system.cpu.util[12,idle,avg5] /usr/local/bin/zabbix_get -s localhost -k system.cpu.util[13,idle,avg5] /usr/local/bin/zabbix_get -s localhost -k system.cpu.util[14,idle,avg5] /usr/local/bin/zabbix_get -s localhost -k system.cpu.util[15,idle,avg5] produces 98.371562 99.763491 99.833444 100.000000 0.000000 <---- Kim already reported about this strange value. ZBX_NOTSUPPORTED: Cannot obtain CPU information. ZBX_NOTSUPPORTED: Cannot obtain CPU information. ZBX_NOTSUPPORTED: Cannot obtain CPU information. ZBX_NOTSUPPORTED: Cannot obtain CPU information. ZBX_NOTSUPPORTED: Cannot obtain CPU information. ZBX_NOTSUPPORTED: Cannot obtain CPU information. ZBX_NOTSUPPORTED: Cannot obtain CPU information. ZBX_NOTSUPPORTED: Cannot obtain CPU information. ZBX_NOTSUPPORTED: Cannot obtain CPU information. ZBX_NOTSUPPORTED: Cannot obtain CPU information. ZBX_NOTSUPPORTED: Cannot obtain CPU information. perfstat_cpu() if called as perfstat_cpu(NULL, NULL, sizeof(ps_cpu), 0) returns 4 as number of available CPU statistics - that is correct. The wrong status of "cpu4" is set in update_cpustats() as shown with additional debug: 7536656:20170714:091126.519 in update_cpustats(): idx=1 ps_id.name=cpu0 7536656:20170714:091126.519 in update_cpustats(): idx=2 ps_id.name=cpu1 7536656:20170714:091126.519 in update_cpustats(): idx=3 ps_id.name=cpu2 7536656:20170714:091126.519 in update_cpustats(): idx=4 ps_id.name=cpu3 7536656:20170714:091126.520 in update_cpustats(): idx=5 ps_id.name=cpu4 <--- So perfstat_cpu() returned data for 'cpu4', which was expected to be "offline' 7536656:20170714:091126.520 in update_cpustats(): idx=6 ps_id.name=cpu5 7536656:20170714:091126.520 in update_cpustats(): idx=6 ps_id.name=cpu5 perfstat_cpu() returned -1, will set this CPU status to OFFLINE 7536656:20170714:091126.520 in update_cpustats(): idx=7 ps_id.name=cpu6 7536656:20170714:091126.520 in update_cpustats(): idx=7 ps_id.name=cpu6 perfstat_cpu() returned -1, will set this CPU status to OFFLINE .... |
Comment by Andris Mednis [ 2017 Jul 14 ] |
To fix LLD I propose to change update_cpu_counters() to use only 2 calls of perfstat_cpu(): at first get number of CPU statistics, then get all of them in the second call. |
Comment by Andrea Biscuola (Inactive) [ 2017 Oct 19 ] |
Hi all It seems to me that the zabbix agent doesn't handle properly the AIX hypervisor capabilities to automatically scale the hardware assigned to a virtual machine. Also, when the zabbix agent is monitoring an AIX LPAR of WPAR, it should report hardware details only of the hardware assigned to the VM and NOT related to the whole machine. IMHO, we should check the perfstat APIs documentation we are using (because we are probably doing it wrong), and re-implement the CPU checks for AIX. |
Comment by Andrea Biscuola (Inactive) [ 2017 Oct 19 ] |
After more investigation, the perfstat APIs have a dedicated series of calls for gathering statistics from the current WPAR (that it's what we want). In my opinion, the zabbix agent should be modified for: 1 - Identifying if we are inside a WPAR or not Actually we re using calls that are meant for the global environment (and not for monitoring a single VM). |
Comment by Andris Zeila [ 2017 Oct 19 ] |
I agree about monitoring hardware assigned only to current VM. However I'm not sure how to deal with possible hardware changes. Zabbix keeps cpu load statistics for the last 15 minutes. If cpu starts changing - this would mean invalidating the cached data. If cpu are changing often then we cannot provide such statistics. |
Comment by Andrea Biscuola (Inactive) [ 2017 Oct 23 ] |
Partial fix in svn://svn.zabbix.com/branches/dev/ZBX-11902 The zabbix agent was previously counting the number of CPUs Also, the fix does not address the changes required for monitoring |
Comment by Andris Mednis [ 2017 Oct 24 ] |
Thanks, Andrea ! Successfully tested. |
Comment by Andris Mednis [ 2017 Oct 24 ] |
I think system.cpu.num[] change (earlier it returned number of physical CPUs, now it returns number of logical CPUs) should be documented in "what's new" and "Upgrade notes". |
Comment by Andrea Biscuola (Inactive) [ 2017 Oct 24 ] |
Released in:
|
Comment by Andrea Biscuola (Inactive) [ 2017 Oct 24 ] |
As andris pointed out, the system.cpu.num value is now related to the logical |
Comment by Martins Valkovskis [ 2017 Oct 25 ] |
(1) [D] Updated documentation: RESOLVED abs Documentation looks good. CLOSED |
Comment by Alexander Vladishev [ 2017 Oct 30 ] |
Sub-issue (1) still open. abs, please have a look. |