Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-24080

Zabbix Agent 2 Plugins Cpu and VFSDev fills up capacity

XMLWordPrintable

    • Icon: Problem report Problem report
    • Resolution: Unresolved
    • Icon: Trivial Trivial
    • None
    • 6.4.11
    • Agent2 plugin (N)
    • None
    • Debian 10 with Proxmox 6.4 and self compiled kernel in version 5.3.13

      Steps to reproduce:
      Currently not possible to reproduce it specifically. It occours randomly on 1% of 7000 hosts per week. Doesn't occour again on the same host after debug log was enabled and I would prevent to enable debug logging in all 7000 hosts.
      So lets describe the environment:

      • all hosts are Debian-based Proxmox hosts of different versions
      • all have Zabbix Agent 2 Version 6.4.11, Server and Proxys are 6.4.11 also
      • only 2 Agent 2 Plugins are affected: Cpu and VFSDev
      • round about 120 agent items per host, mostly built-in item keys with intervals between 5 and 15 minutes
        • for Cpu plugin following item keys are used:
          • system.cpu.util
          • system.cpu.util[,iowait,avg1]
          • system.cpu.util[,idle,avg1]
          • system.cpu.num
        • ZabbixAsync Plugin with Item key system.cpu.load is still working, like all other item keys
      • normally all capacity counters are 0
        • I tried it also with a stress test and call the affected item keys at working time with multiple threads thousands times but it worked well.
      • seems that there is a special scenario why the agent doesn't free up the threads

      Checked also:

      • no memory leak, working and broken agents needs round about 30MB RSS memory
      • threads don't clean up after time, if there are no new requests to the agent (blocked via iptables for debugging)
      • open files count far away from any limit

      Result:
      capacity and tasks of affected plugins fills up. Metrics output:

      [Cpu]
      active: true
      capacity: 100/100
      check on start: 0
      tasks: 6484
      system.cpu.discovery: List of detected CPUs/CPU cores, used for low-level discovery.
      system.cpu.num: Number of CPUs.
      system.cpu.util: CPU utilization percentage.
      
      [VFSDev]
      active: true
      capacity: 100/100
      check on start: 0
      tasks: 220
      vfs.dev.discovery: List of block devices and their type. Used for low-level discovery.
      vfs.dev.read: Disk read statistics.
      vfs.dev.write: Disk write statistics.
      
      proxy1 # zabbix_get -s <ip> --tls-connect psk --tls-psk-identity "Agent_m16346" --tls-psk-file /etc/zabbix/zabbix_agent.psk -k system.cpu.num
      zabbix_get [2553466]: Timeout while executing operation
      proxy1 # zabbix_get -s <ip> --tls-connect psk --tls-psk-identity "Agent_m16346" --tls-psk-file /etc/zabbix/zabbix_agent.psk -k system.cpu.load
      57.280000
      

      Debug logs show no helpful information after capacity limit is reached already.

      Expected:
      Empty cacpacity and working items

      Additional:
      Any hints what I could do, to give you more details?

            mbuz Maksym Buz
            marcel.japel Marcel Jäpel
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: