Uploaded image for project: 'ZABBIX FEATURE REQUESTS'
  1. ZABBIX FEATURE REQUESTS
  2. ZBXNEXT-4030

CPU utilization in an environment with a variable number of CPUs

XMLWordPrintable

    • Icon: New Feature Request New Feature Request
    • Resolution: Unresolved
    • Icon: Minor Minor
    • None
    • 3.2.7
    • Agent (G)

      Similar to ZBX-11857, we have an ARM board running Linaro Linux. The system is smart in that it activates CPUs depending on load.

      Unfortunately, in such an environment with dynamic number of CPUs, it is non-trivial to measure CPU usage. Consider a scenario, where we start with an idle system with 1 online CPU and gradually increase CPU load to put 2, 3, and 4 CPUs online and consume them fully. As can be seen in the screenshot below, Zabbix reports 50% utilization with 1 CPU fully loaded, and then it reports 100% utilization with 2, 3, and 4 CPUs loaded to capacity:

      This is not particularly useful for performance testing, because it is not possible to distinguish between 2 and 4 CPUs fully loaded. However, this behavior is somewhat consistent with other Linux tools like top and vmstat, as will be shown in the comments later. It is also a problem without an obvious solution, because if we have 2 CPUs loaded out of 4, but only 2 are online, then depending on how we look at it and depending on whether the other 2 can actually be brought online, both 50% or 100% are acceptable.

      So what is desired, is probably a mode for system.cpu.util[] that would count offline CPUs as idle or count the time spent offline as a separate state (e.g., system.cpu.util[,offline]). Below, a solution idea is proposed.

      Currently, Zabbix seems to read CPU statistics from /proc/stat on Linux:

      # grep cpu /proc/stat
      cpu  7539 0 236984 837919 3226 19 1297 0 0 0
      cpu0 3177 0 58211 610349 2960 19 1220 0 0 0
      cpu1 1130 0 57962 60488 153 0 14 0 0 0
      cpu2 1394 0 59148 70361 26 0 4 0 0 0
      cpu3 1838 0 61663 96721 87 0 59 0 0 0
      

      Here is a brief description of the format:

      proc/stat
        kernel/system statistics.  Varies with architecture.  Common entries include:
      
        cpu  3357 0 4313 1362393
           The amount of time, measured in units of USER_HZ (1/100ths of a second on most architectures, use sysconf(_SC_CLK_TCK) to	obtain	the
           right value), that the system spent in various states:
      
           user   (1) Time spent in user mode.
      
           nice   (2) Time spent in user mode with low priority (nice).
      
           system (3) Time spent in system mode.
      
           idle   (4) Time spent in the idle task.  This value should be USER_HZ times the second entry in the /proc/uptime pseudo-file.
      
           ...
      

      Now, suppose that cpu1 was online and idle during the last second. Then, its fourth number will be increased by USER_HZ (presumably, 100). If, however, cpu1 was offline, no numbers will be increased.

      If cpu1 switched between online and offline during the last second, then a reasonable conjecture is that the increase in numbers will not add up to USER_HZ. This is approximately how we can infer that the CPU was offline based on these statistics and calculate its amount.

      It should also be noted here separately that, if cpu1 was online for just a little while and only "user" usage was increased (e.g., by just 5 out of 100), then system.cpu.util[1,user] will probably report 100% for that period, which is not perfectly true.

        1. system-cpu-util.png
          50 kB
          Aleksandrs Saveljevs

            Unassigned Unassigned
            asaveljevs Aleksandrs Saveljevs
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: