[ZBX-9067] Solaris agent item system.cpu.util values returns 0.000000 after a few minutes Created: 2014 Nov 21  Updated: 2022 Jul 26  Resolved: 2015 Jun 15

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Agent (G)
Affects Version/s: 2.4.0, 2.4.1
Fix Version/s: None

Type: Incident report Priority: Blocker
Reporter: Ronny Pettersen Assignee: Unassigned
Resolution: Won't fix Votes: 1
Labels: item, solaris, system.cpu.util, zones
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

zabbix_agents_2.4.1.solaris10.sparc


Attachments: PNG File chart.php.png     Zip Archive zabbix_agentd.log.zip    
Issue Links:
Causes
Duplicate
is duplicated by ZBX-9080 system.cpu.util returns 0 on Solaris ... Closed

 Description   

Some system counters get set to 0.000000 some minutes after restart of agent on Solaris 10 (sparc).
The time this happens varies, usually within 30 minutes of agent restart.
These are the only items affected:

system.cpu.util[,user]
system.cpu.util[,system]
system.cpu.util[,iowait]
system.cpu.util[,idle]

This behavior is observed with the Zabbix Solaris 10 (Sparc) agent version 2.4.0 and 2.4.1 (pre-compuled binary downloaded from Zabbix).

See log with debug (attached) - first occurence of problem (13 minutes after start of agent):
4918:20141121:102703.308 for key [system.cpu.util[,idle]] received value [0.000000]

zabbix@HostName /opt/zabbix $ bin/zabbix_get -s localhost -k "system.cpu.switches"
5723982175
zabbix@HostName /opt/zabbix $ bin/zabbix_get -s localhost -k "system.cpu.util[,user]"
0.000000
zabbix@HostName /opt/zabbix $ bin/zabbix_get -s localhost -k "system.cpu.util[,system]"
0.000000
zabbix@HostName /opt/zabbix $ bin/zabbix_get -s localhost -k "system.cpu.util[,iowait]"
0.000000
zabbix@HostName /opt/zabbix $ bin/zabbix_get -s localhost -k "system.cpu.util[,idle]"
0.000000

<restart agent>

zabbix@HostName  /opt/zabbix $ bin/zabbix_get -s localhost -k "system.cpu.util[,user]"
2.330420
zabbix@HostName /opt/zabbix $ bin/zabbix_get -s localhost -k "system.cpu.util[,idle]"
94.382490


 Comments   
Comment by Aleksandrs Saveljevs [ 2014 Nov 26 ]

Are Solaris zones used? Could this be the same as ZBX-9080?

Comment by Aleksandrs Saveljevs [ 2014 Dec 11 ]

There has been no answer from the reporter, but we shall assume that the issue is the same as in ZBX-9080 and Solaris zones with dynamic CPU allocation are involved.

Let me describe how CPU monitoring currently works on Solaris. When the agent starts up, it determines the CPU count on the system. It then allocates an area in shared memory that keeps counters for each CPU. The size of this memory does not change and the number of CPUs monitored is fixed. Moreover, the exact set of CPUs that are monitored is fixed, too. So if at agent startup the allocated CPUs were 1, 4, 5, and later changed to 1, 5, 8, then CPU 8 will not be monitored. Since ZBX-6576, when CPU 8 is encountered, the agent will log a line like the following and be satisfied with it:

  4916:20141121:101735.211 1 new processor(s) added. Restart Zabbix agentd to enable collecting new data.

This line is also present in the attached log. Thus, the behavior is kind of expected - agent should be restarted to pick up the new CPUs.

One solution to the above problem would be to replace CPU 4 with CPU 8 in the example above, and monitor CPU 8 instead of CPU 4 (if we are in a non-global zone; if we are in a global zone, we have to keep the CPU set constant). This looks doable, but there is another problem, described below.

According to src/zabbix_agent/cpustat.c, we use "cpu_stat" kstat module (see refresh_kstat() function) to read CPU metrics. The problem seems to be that kstat metrics are global - even if read in a non-global zone, they still contain the same data as for the global zone. So even if we apply the change above, the item would still return numbers for the global zone.

The same seems to happen with "top" command. If we launch a CPU-intensive job in the global zone, then "top" in both global zone and non-global zone show CPU usage of 100%.

For Solaris 11, there seems to be "zonestat" utility that would show zone-specific numbers (http://docs.oracle.com/cd/E26502_01/html/E29030/zonestat-1.html). However, according to "zonestatd" manual (http://docs.oracle.com/cd/E26502_01/html/E29031/zonestatd-1m.html), the daemon on which "zonestat" relies, it "does not constitute a programming interface; it is classified as a private interface", which would indicate that there is no public API to get these zone-specific numbers.

For Solaris 10, there seems to be "zonestat.pl" script (downloaded from http://www.unixarena.com/2013/05/solaris-local-zone-wise-memory-cpu.html), but it might not be precise, too:

# Problems:
#  * By far the "most broken" part of this prototype is CPU%.
#    It is not difficult to create surprising results, e.g. on a CMT system,
#    set a CPU cap on a zone in a pool, and run a few CPU-bound processes:
#    the "Pset Used" column will not reach the CPU cap.

So the conclusion is that a proper fix might not be trivial.

wiper I have to agree with the above conclusion. I found that prstat -Z displays per zone CPU usage, but it appears to simply sum process cpu values. Also I investigated zonestatd a little, but from what I undestood it also calculates zone CPU usage by summing process cpu usage.

Comment by Andris Zeila [ 2015 Jan 16 ]

So to clarify - we are going to fix the initial problem with CPUs moving between zones as described by Aleksandrs.

Regarding the statistics problem - there is nothing much we can currently do (iterating through processes to calculate zone cpu usage is not really an option for CPU collector). But we must document that zone cpu statistics (utilization, load) are based on processor set statistics assigned to this zone. If this processor set is used by other zones, then the cpu statistics returns the summary cpu statistics for all zones using this processor set.

wiper Unfortunatelly changing cpus would mean we have to reset their statistics. And if it happens often then trying to calculate average values (1-15 min) becomes somewhat useless. So there is no satisfying solution available currently.

Comment by Andris Zeila [ 2015 Jun 15 ]

The system.cpu.util key should work correctly on kernel zones introduced in Solaris 11.2.

As for non-kernel zones - the proc.cpu.util (to be added in Zabbix 3.0, see ZBXNEXT-494) can be used as a workaround to monitor cpu utilization. It will use more system resources though.

Generated at Sat May 17 07:53:55 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.