[ZBX-11902] CPU monitoring issues in AIX 7.1 Created: 2017 Mar 13  Updated: 2024 Apr 10  Resolved: 2017 Oct 30

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: None
Fix Version/s: 3.0.13rc1, 3.2.10rc1, 3.4.4rc1, 4.0.0alpha1, 4.0 (plan)

Type: Problem report Priority: Blocker
Reporter: Kim Jongkwon Assignee: Viktors Tjarve
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

AIX 7.1 (7100-04)
Zabbix Agent 3.0.5


Issue Links:
Causes
causes ZBX-13645 Agent does not compile and run on AIX... Closed
Duplicate
Sub-task
Team: Team A
Sprint: Sprint 4, Sprint 5, Sprint 6, Sprint 7, Sprint 8, Sprint 9, Sprint 10, Sprint 11, Sprint 12, Sprint 13, Sprint 14, Sprint 15, Sprint 16, Sprint 17, Sprint 18, Sprint 19, Sprint 20
Story Points: 4

 Description   

mpstat and sar commands to check the cpu utilisation but the results are not matching.
Zabbix was return '8 cores' in 4 cores CPU.

1. Discovery of CPU cores

  • system.cpu.discovery
[{"data":[
{"{#CPU.NUMBER}":0,"{#CPU.STATUS}":"online"},
{"{#CPU.NUMBER}":1,"{#CPU.STATUS}":"online"},
{"{#CPU.NUMBER}":2,"{#CPU.STATUS}":"online"},
{"{#CPU.NUMBER}":3,"{#CPU.STATUS}":"online"},
{"{#CPU.NUMBER}":4,"{#CPU.STATUS}":"online"},
{"{#CPU.NUMBER}":5,"{#CPU.STATUS}":"offline"},
{"{#CPU.NUMBER}":6,"{#CPU.STATUS}":"offline"},
{"{#CPU.NUMBER}":7,"{#CPU.STATUS}":"offline"}
]}]

2. Discovered CPUs items by LLD

  • system.cpu.util[<core>,idle,avg1]

cpu(0~3) is fine.

cpu(4) return a strange values. cpu.util idle gets '0%' only

system.cpu.util[4,idle,avg1]

cpu(5~7) is Not supported : 'Cannot obtain CPU information.'

system.cpu.util[5,idle,avg1]
system.cpu.util[6,idle,avg1]
system.cpu.util[7,idle,avg1]

FYI. There is no problem in the AIX 6.1 environment.
Sorry, I'm not sure about this issue on AIX 6.1.



 Comments   
Comment by Andris Mednis [ 2017 Jul 14 ]

I tried to investigate in version 3.0.10rc1 why system.cpu.discovery[] discovers one more CPU than expected.
On a system with 4 CPUs:

$ mpstat

System configuration: lcpu=4 ent=0.2 mode=Uncapped 

cpu  min  maj  mpc  int   cs  ics   rq  mig lpa sysc us sy wa id   pc  %ec  lcs
  0 40399094 90876  420 85977844 37275265 664954    1 67467 100 130450383 34 49  0 17 0.00  0.5 66159248
  1 2502320 6055  143 8423769 312954 74135    0 58083 100 3079389 18 27  0 54 0.00  0.1 8613174
  2 20462156 39412  140 9474179 274846 67835    0 53635 100 21821343 68 23  0  9 0.00  0.2 9699004
  3 604136 2211  141 3680381 102788 40736    0 40328 100 550318 14 26  0 60 0.00  0.0 3729658
  U    -    -    -    -    -    -    -    -   -    -  -  -  0 99 0.20 99.2    -
ALL 63967706 138554  844 107556173 37965853 847660    1 219513 100 155901433  0  0  0 99 0.00  0.8 88201084

Zabbix correctly reports number of CPUs:

$ /usr/local/bin/zabbix_get -s localhost -k system.cpu.num                                  
4

However LLD discovers 5 CPUs in state "online" (reformatted for readability):

$ /usr/local/bin/zabbix_get -s localhost -k system.cpu.discovery
{"data":[
   {"{#CPU.NUMBER}":0,"{#CPU.STATUS}":"online"},
   {"{#CPU.NUMBER}":1,"{#CPU.STATUS}":"online"},
   {"{#CPU.NUMBER}":2,"{#CPU.STATUS}":"online"},
   {"{#CPU.NUMBER}":3,"{#CPU.STATUS}":"online"},
   {"{#CPU.NUMBER}":4,"{#CPU.STATUS}":"online"},
   {"{#CPU.NUMBER}":5,"{#CPU.STATUS}":"offline"},
   {"{#CPU.NUMBER}":6,"{#CPU.STATUS}":"offline"},
   {"{#CPU.NUMBER}":7,"{#CPU.STATUS}":"offline"},
   {"{#CPU.NUMBER}":8,"{#CPU.STATUS}":"offline"},
   {"{#CPU.NUMBER}":9,"{#CPU.STATUS}":"offline"},
   {"{#CPU.NUMBER}":10,"{#CPU.STATUS}":"offline"},
   {"{#CPU.NUMBER}":11,"{#CPU.STATUS}":"offline"},
   {"{#CPU.NUMBER}":12,"{#CPU.STATUS}":"offline"},
   {"{#CPU.NUMBER}":13,"{#CPU.STATUS}":"offline"},
   {"{#CPU.NUMBER}":14,"{#CPU.STATUS}":"offline"},
   {"{#CPU.NUMBER}":15,"{#CPU.STATUS}":"offline"}
]}

Running script

/usr/local/bin/zabbix_get -s localhost -k system.cpu.util[0,idle,avg5]
/usr/local/bin/zabbix_get -s localhost -k system.cpu.util[1,idle,avg5]
/usr/local/bin/zabbix_get -s localhost -k system.cpu.util[2,idle,avg5]
/usr/local/bin/zabbix_get -s localhost -k system.cpu.util[3,idle,avg5]
/usr/local/bin/zabbix_get -s localhost -k system.cpu.util[4,idle,avg5]
/usr/local/bin/zabbix_get -s localhost -k system.cpu.util[5,idle,avg5]
/usr/local/bin/zabbix_get -s localhost -k system.cpu.util[6,idle,avg5]
/usr/local/bin/zabbix_get -s localhost -k system.cpu.util[7,idle,avg5]
/usr/local/bin/zabbix_get -s localhost -k system.cpu.util[8,idle,avg5]
/usr/local/bin/zabbix_get -s localhost -k system.cpu.util[9,idle,avg5]
/usr/local/bin/zabbix_get -s localhost -k system.cpu.util[10,idle,avg5]
/usr/local/bin/zabbix_get -s localhost -k system.cpu.util[11,idle,avg5]
/usr/local/bin/zabbix_get -s localhost -k system.cpu.util[12,idle,avg5]
/usr/local/bin/zabbix_get -s localhost -k system.cpu.util[13,idle,avg5]
/usr/local/bin/zabbix_get -s localhost -k system.cpu.util[14,idle,avg5]
/usr/local/bin/zabbix_get -s localhost -k system.cpu.util[15,idle,avg5]

produces

98.371562
99.763491
99.833444
100.000000
0.000000	  <---- Kim already reported about this strange value.
ZBX_NOTSUPPORTED: Cannot obtain CPU information.
ZBX_NOTSUPPORTED: Cannot obtain CPU information.
ZBX_NOTSUPPORTED: Cannot obtain CPU information.
ZBX_NOTSUPPORTED: Cannot obtain CPU information.
ZBX_NOTSUPPORTED: Cannot obtain CPU information.
ZBX_NOTSUPPORTED: Cannot obtain CPU information.
ZBX_NOTSUPPORTED: Cannot obtain CPU information.
ZBX_NOTSUPPORTED: Cannot obtain CPU information.
ZBX_NOTSUPPORTED: Cannot obtain CPU information.
ZBX_NOTSUPPORTED: Cannot obtain CPU information.
ZBX_NOTSUPPORTED: Cannot obtain CPU information.

perfstat_cpu() if called as perfstat_cpu(NULL, NULL, sizeof(ps_cpu), 0) returns 4 as number of available CPU statistics - that is correct.

The wrong status of "cpu4" is set in update_cpustats() as shown with additional debug:

7536656:20170714:091126.519 in update_cpustats(): idx=1 ps_id.name=cpu0
7536656:20170714:091126.519 in update_cpustats(): idx=2 ps_id.name=cpu1
7536656:20170714:091126.519 in update_cpustats(): idx=3 ps_id.name=cpu2
7536656:20170714:091126.519 in update_cpustats(): idx=4 ps_id.name=cpu3
7536656:20170714:091126.520 in update_cpustats(): idx=5 ps_id.name=cpu4	<--- So perfstat_cpu() returned data for 'cpu4', which was expected to be "offline'
7536656:20170714:091126.520 in update_cpustats(): idx=6 ps_id.name=cpu5
7536656:20170714:091126.520 in update_cpustats(): idx=6 ps_id.name=cpu5 perfstat_cpu() returned -1, will set this CPU status to OFFLINE
7536656:20170714:091126.520 in update_cpustats(): idx=7 ps_id.name=cpu6
7536656:20170714:091126.520 in update_cpustats(): idx=7 ps_id.name=cpu6 perfstat_cpu() returned -1, will set this CPU status to OFFLINE
....
Comment by Andris Mednis [ 2017 Jul 14 ]

To fix LLD I propose to change update_cpu_counters() to use only 2 calls of perfstat_cpu(): at first get number of CPU statistics, then get all of them in the second call.
After fixing LLD investigate CPU load calculation.

Comment by Andrea Biscuola (Inactive) [ 2017 Oct 19 ]

Hi all

It seems to me that the zabbix agent doesn't handle properly the AIX hypervisor capabilities to automatically scale the hardware assigned to a virtual machine. Also, when the zabbix agent is monitoring an AIX LPAR of WPAR, it should report hardware details only of the hardware assigned to the VM and NOT related to the whole machine.
What andris reported, can be related to the automatic hardware scaling in action, where a CPU is being attached/detached, but the operation is still not completed (preparation phase).
As a note, the SMT (Symmetric multi-threading), can vary between POWER processor models (some have SMT-2, some SMT-4, some SMT-8 and so on), but for a WPAR or LPAR this should not matter, as even logical core assigned to it, are shown as cpus.

IMHO, we should check the perfstat APIs documentation we are using (because we are probably doing it wrong), and re-implement the CPU checks for AIX.

Comment by Andrea Biscuola (Inactive) [ 2017 Oct 19 ]

After more investigation, the perfstat APIs have a dedicated series of calls for gathering statistics from the current WPAR (that it's what we want). In my opinion, the zabbix agent should be modified for:

1 - Identifying if we are inside a WPAR or not
2 - use the right calls for retrieving the CPU statistics
3 - report them.

Actually we re using calls that are meant for the global environment (and not for monitoring a single VM).

Comment by Andris Zeila [ 2017 Oct 19 ]

I agree about monitoring hardware assigned only to current VM. However I'm not sure how to deal with possible hardware changes. Zabbix keeps cpu load statistics for the last 15 minutes. If cpu starts changing - this would mean invalidating the cached data. If cpu are changing often then we cannot provide such statistics.

Comment by Andrea Biscuola (Inactive) [ 2017 Oct 23 ]

Partial fix in svn://svn.zabbix.com/branches/dev/ZBX-11902

The zabbix agent was previously counting the number of CPUs
on a global basis (machine-wide), instead of monitoring just
the LPAR (virtual machine) where it was running. Fix the count
by using the right APIs for that.
Out of the question remain the WPAR partitions for now, as
those share a pool of resources with the parent LPAR and are
more like containers or jails.

Also, the fix does not address the changes required for monitoring
dynamic environments properly. That will be argument for a
separate issue.

Comment by Andris Mednis [ 2017 Oct 24 ]

Thanks, Andrea ! Successfully tested.

Comment by Andris Mednis [ 2017 Oct 24 ]

I think system.cpu.num[] change (earlier it returned number of physical CPUs, now it returns number of logical CPUs) should be documented in "what's new" and "Upgrade notes".

Comment by Andrea Biscuola (Inactive) [ 2017 Oct 24 ]

Released in:

  • pre-3.0.13rc1 r73868
  • pre-3.2.10rc1 r73869
  • pre-3.4.4rc1 r73870
  • pre-4.0.0alpha1 (trunk) r73871
Comment by Andrea Biscuola (Inactive) [ 2017 Oct 24 ]

martins-v

As andris pointed out, the system.cpu.num value is now related to the logical
processors attached to an AIX LPAR and not the physical ones.

Comment by Martins Valkovskis [ 2017 Oct 25 ]

(1) [D] Updated documentation:

RESOLVED

abs Documentation looks good. CLOSED

Comment by Alexander Vladishev [ 2017 Oct 30 ]

Sub-issue (1) still open. abs, please have a look.

Generated at Thu Apr 25 16:17:23 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.