-
Problem report
-
Resolution: Unresolved
-
Trivial
-
None
-
6.4.11
-
None
-
Debian 10 with Proxmox 6.4 and self compiled kernel in version 5.3.13
Steps to reproduce:
Currently not possible to reproduce it specifically. It occours randomly on 1% of 7000 hosts per week. Doesn't occour again on the same host after debug log was enabled and I would prevent to enable debug logging in all 7000 hosts.
So lets describe the environment:
- all hosts are Debian-based Proxmox hosts of different versions
- all have Zabbix Agent 2 Version 6.4.11, Server and Proxys are 6.4.11 also
- only 2 Agent 2 Plugins are affected: Cpu and VFSDev
- round about 120 agent items per host, mostly built-in item keys with intervals between 5 and 15 minutes
-
- for Cpu plugin following item keys are used:
- system.cpu.util
- system.cpu.util[,iowait,avg1]
- system.cpu.util[,idle,avg1]
- system.cpu.num
- ZabbixAsync Plugin with Item key system.cpu.load is still working, like all other item keys
- for Cpu plugin following item keys are used:
- normally all capacity counters are 0
- I tried it also with a stress test and call the affected item keys at working time with multiple threads thousands times but it worked well.
- seems that there is a special scenario why the agent doesn't free up the threads
Checked also:
- no memory leak, working and broken agents needs round about 30MB RSS memory
- threads don't clean up after time, if there are no new requests to the agent (blocked via iptables for debugging)
- open files count far away from any limit
Result:
capacity and tasks of affected plugins fills up. Metrics output:
[Cpu] active: true capacity: 100/100 check on start: 0 tasks: 6484 system.cpu.discovery: List of detected CPUs/CPU cores, used for low-level discovery. system.cpu.num: Number of CPUs. system.cpu.util: CPU utilization percentage. [VFSDev] active: true capacity: 100/100 check on start: 0 tasks: 220 vfs.dev.discovery: List of block devices and their type. Used for low-level discovery. vfs.dev.read: Disk read statistics. vfs.dev.write: Disk write statistics.
proxy1 # zabbix_get -s <ip> --tls-connect psk --tls-psk-identity "Agent_m16346" --tls-psk-file /etc/zabbix/zabbix_agent.psk -k system.cpu.num zabbix_get [2553466]: Timeout while executing operation proxy1 # zabbix_get -s <ip> --tls-connect psk --tls-psk-identity "Agent_m16346" --tls-psk-file /etc/zabbix/zabbix_agent.psk -k system.cpu.load 57.280000
Debug logs show no helpful information after capacity limit is reached already.
Expected:
Empty cacpacity and working items
Additional:
Any hints what I could do, to give you more details?