[ZBX-24504] Proxy monitoring kubernetes cluster is too resource hungry Created: 2024 May 17  Updated: 2026 Feb 04  Resolved: 2025 Oct 28

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S), Templates (T)
Affects Version/s: 6.4.13, 6.4.14
Fix Version/s: 7.0.20rc1, 7.4.4rc1, 8.0.0alpha1

Type: Problem report Priority: Major
Reporter: Robin Roevens Assignee: Andris Zeila
Resolution: Fixed Votes: 7
Labels: Proxy, Template, helm, kubernetes
Remaining Estimate: Not Specified
Time Spent: 14h
Original Estimate: Not Specified
Environment:

SLES 15 SP5
K3s v1.27.13+k3s1


Attachments: JPEG File PreprocessingErroreJsonPath.jpg     File get_state_metrics.json.xz     PNG File image-2025-10-30-11-05-03-403.png     PNG File image-2025-10-30-11-06-37-884.png     PNG File modify script.png     PNG File open_script.png     PNG File pod-cpu-usage-proxy-7-0-17.png     PNG File pod-cpu-usage-proxy-7-0-20.png     File prometheus_compound_index.diff     File prometheus_strloc.diff     PNG File proxy-performance.png     Text File zabbix-proxy-7-0-20-diaginfo-preprocessing.txt    
Issue Links:
Duplicate
Sub-task
depends on ZBX-26989 Improve Kubernetes cluster state by H... Confirmed
Team: Team A
Sprint: S25-W40/41, S25-W42/43
Story Points: 2

 Description   

We have set up a few K3s clusters, monitored by Zabbix using the Zabbix provided Kubernetes templates (updated to latest available for git.zabbix.com tag 6.4.14) and helm charts.

Most clusters do not yet have much workload, and are monitored without problems using these resource limits on the proxy pod in the cluster:
requests:
memory: "128Mi"
cpu: "250m"
limits:
memory: "8Gi"
cpu: "2000m"
The biggest cluster of those that work without problems, running 18 deployments, 39 replicasets and 41 pods running on 3 cp and 4 worker nodes, monitors around 18000 items on 29 hosts. Proxy pollers are around 1.1% busy and preprocessor manager around 1.9% and the preprocessor workers around 15% (using the default proxy setting of 3 preprocessors).

 

However one cluster (also 3 cp's, 4 workers) has a much higher load of 192 deployments, 523 replicasets (most of them are 'old', inactive sets that are keps by kubernetes for deployment rollback purposes) and 221 pods. I don't think this is already a very big load for a kubernetes cluster, and there are also about 120 additional projects scheduled to be deployed on that cluster, so the load will only grow.

I have already altered the Kubernetes cluster state template to skip replicasets with 0 replica's to filter out the inactive replicasets preventing about 1500 extra items from being discovered. This currently results in Zabbix trying to monitor +/- 26000 items on 29 hosts for that cluster.

This however proves to be very challenging for the Proxy on that cluster as the 3 default preprocessors are 100% busy and the proxy crashes regularly due to a lack of available preprocessor workers. So we start scaling up the proxy. Currently it is set to
requests:
memory: "8Gi"
cpu: "6000m"
limits:
memory: "16Gi"
cpu: "12000m"
And the number of preprocessors has been increased to 
ZBX_STARTPREPROCESSORS: 60
Now the preprocessor worker processes are about 80% busy but the preprocessor manager is now at 100% and the poller processes went up to also being 100% busy. Still causing many items to not get processed in time and over 700 items in queue > 10 min. The actual CPU usage of the proxy pod is constantly around 7.5 CPU's and the pod gets OOM killed about every 2 to 3 hours due to hitting the memory limit.

I could still up the allowed resources, preprocessors and pollers. But I think this getting way out of proportion, the consumed resources, only for monitoring.

Also it strikes me that with only the defaults it is able to monitor 18000 items, but only 8000 more items would suddenly need over 60 preprocessors more, and at least 3 times the default resource limits. This looks to me like there is some turnover point where resources are suddenly exponentially required for only a little more data.

The culprit seems to be the Kubernetes cluster state by HTTP template: Get state metrics which currently returns 37113 lines on that cluster and is then preprocessed by almost all other discovered items.

Possibly the preprocessing can be optimized? And/or the template itself?



 Comments   
Comment by Vladimir Povetkin [ 2025 Jan 20 ]

We have similar issue. Preprocessor queue growing for K8S cluster with 30k items

Comment by Mateusz Mazur [ 2025 Mar 24 ]

same problem here

Comment by Samuele Bianchi [ 2025 Mar 26 ]

I can confirm this problem using zabbix 7.0.10 release on Debian 12
On large cluster with 22 host and about 18490 item, the proxy grows on cpu usage and in memory.
Here some logs form zabbix-server.log

 

I have increased the preprocessing number of instances to 30, but the problem is still present.
Using ubuntu instead of alpine proxy image , for proxy, the used reseources seem to be less for CPU but more for RAM.

Comment by Artūras Kupčinskas [ 2025 Mar 27 ]

Hi,

First we change deployment and replicaset triggers on template (because those data is stuck on postprocesing) :

 

old:
min(/Kubernetes cluster state by HTTP without POD/kube.replicaset.replicas_mismatched{#NAMESPACE}/{#NAME},{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD:"replicaset:

{#NAMESPACE}:{#NAME}"})>0
and last(/Kubernetes cluster state by HTTP without POD/kube.replicaset.replicas_desired{#NAMESPACE}/{#NAME})>=0
and last(/Kubernetes cluster state by HTTP without POD/kube.replicaset.ready{#NAMESPACE}/{#NAME})>=0

new:
abs(min(/Kubernetes cluster state by HTTP without POD/kube.replicaset.replicas_desired{#NAMESPACE}/{#NAME},{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD:"replicaset:{#NAMESPACE}

:{#NAME}"}) - min(/Kubernetes cluster state by HTTP without POD/kube.replicaset.ready{#NAMESPACE}/{#NAME},{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD:"replicaset:

{#NAMESPACE}:{#NAME}"}))>0
and last(/Kubernetes cluster state by HTTP without POD/kube.replicaset.replicas_desired{#NAMESPACE}/{#NAME})>=0
and last(/Kubernetes cluster state by HTTP without POD/kube.replicaset.ready{#NAMESPACE}/{#NAME})>=0


old:
min(/Kubernetes cluster state by HTTP without POD/kube.deployment.replicas_mismatched{#NAMESPACE}/{#NAME},{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD:"deployment:{#NAMESPACE}

:{#NAME}"})>0
and last(/Kubernetes cluster state by HTTP without POD/kube.deployment.replicas_desired{#NAMESPACE}/{#NAME})>=0
and last(/Kubernetes cluster state by HTTP without POD/kube.deployment.replicas_available{#NAMESPACE}/{#NAME})>=0

new:
abs(min(/Kubernetes cluster state by HTTP without POD/kube.deployment.replicas_desired{#NAMESPACE}/{#NAME},{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD:"deployment:

{#NAMESPACE}:{#NAME}"}) - min(/Kubernetes cluster state by HTTP without POD/kube.deployment.replicas_available{#NAMESPACE}/{#NAME},{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD:"deployment:{#NAMESPACE}

:{#NAME}"}))>0
and last(/Kubernetes cluster state by HTTP without POD/kube.deployment.replicas_desired{#NAMESPACE}/{#NAME})>=0
and last(/Kubernetes cluster state by HTTP without POD/kube.deployment.replicas_available{#NAMESPACE}/{#NAME})>=0

 

We also created a second helm yaml to have two separate zabbix-proxy. We ran everything related to Pod discovery through one, and everything else in the other proxy. This solved the resource issue for us.

Comment by Vladislavs Sokurenko [ 2025 Jul 29 ]

Could you please be so kind and provide on Zabbix proxy and Zabbix server 7.0.17
zabbix_server -R diaginfo="preprocessing"

zabbix_proxy-R diaginfo="preprocessing"
Then please provide what kind of preprocessing this itemid uses and if possible please provide sample of data for http request sot we can reproduce internally.

Comment by Mateusz Mazur [ 2025 Jul 29 ]

I will provide after upgrading to 7.0.17 in few day 

Comment by Genadijs Jeniceks [ 2025 Aug 06 ]

Hello, robinr, mateusz.mazur  We would like to get from you the full JSON key object returned by "Get state metrics" item to reproduce the issue. For this you should modify JavaScript script for the "Get state metrics" in the "Kubernetes cluster state by HTTP" template. To follow best practices and avoid extra load you can clone the the template and remove all items and discovery rules except "Get state metrics" item only . Than modify the Script by adding

Zabbix.log(3, 'ZBX-24504' + stateMetrics);

to 144 line (please, check the pictures).


Than create a new host with this template and execute the item. The whole JSON object we need could be found in "/var/log/zabbix/zabbix_server.log" file. Run command as

cat /var/log/zabbix/zabbix_server.log | grepZBX-24504 > get_state_metrics.json

and send us the "get_state_metrics.json" file.

Comment by Robin Roevens [ 2025 Sep 03 ]

gjeniceks: I have yet to migrate to Zabbix 7, so this is generated on Zabbix 6.4: get_state_metrics.json.xz
The logging was logged on the zabbix-proxy pod, not on the server as you indicated, but this should be the entire output.

Comment by Andris Zeila [ 2025 Oct 15 ]

Released ZBX-24504 in:

  • pre-7.0.20rc1 63de44140d4
  • pre-7.4.4rc1 1bcb6d95fcb
  • pre-8.0.0alpha1 31d204ad107
Comment by Mateusz Mazur [ 2025 Oct 30 ]

Installed zabbix proxy 7.0.20  and its looking good. CPU usage dropped nicely on zabbix proxy 

And queue on zabbix-server looks good now

Comment by Lasse Bang Mikkelsen [ 2025 Nov 24 ]

gjeniceks We've seen a great performance increase with 7.0.20, but CPU usage is still far too high. This is CPU usage on AMD EPYC 7502P @ 2.5 GHz.

Zabbix proxy 7.0.17:

Zabbix proxy 7.0.20:

Zabbix performance 7.0.17 vs. 7.0.20:

We're ready to help debug this issue if needed.

 

Comment by Vladislavs Sokurenko [ 2025 Nov 25 ]

Please send us file with metrics as mentioned above and we can try reproducing the issue.

Comment by Lasse Bang Mikkelsen [ 2025 Nov 26 ]

vso Please see attached diaginfo from Zabbix proxy 7.0.20: zabbix-proxy-7-0-20-diaginfo-preprocessing.txt. All items in both "Top.time_ms" and "Top.total_ms" are of the type "Namespace [namespace] ReplicaSet [replicaset]: Replicas mismatched". Let me know if you need further details.

Comment by Vladislavs Sokurenko [ 2025 Nov 28 ]

It was meant like gjeniceks mentioned to provide us with metrics.json please. You can also check from diaginfo what is this item and what preprocessing it have:
itemid:12938813 time_ms:3200

As it is executed too long, please provide javascript for it.

Comment by Lasse Bang Mikkelsen [ 2025 Nov 28 ]

vso As the metrics are large (~9M) and not anonymized, can I e-mail or upload securely?

Comment by Vladislavs Sokurenko [ 2025 Nov 29 ]

It’s possible to send to [email protected] or if still does not ft I can provide instructions how to upload to us.

Comment by Lasse Bang Mikkelsen [ 2025 Nov 29 ]

vso I've e-mailed you the files, thanks!

Comment by Lasse Bang Mikkelsen [ 2025 Dec 08 ]

vso Let me know if you need more.

Comment by Vladislavs Sokurenko [ 2025 Dec 08 ]

Thank you very much for provided information, issue was reproduced and we are working on it.

Comment by Lasse Bang Mikkelsen [ 2026 Feb 04 ]

vso As this issue is closed, is the work tracked elsewhere? Thanks.

Comment by Vladislavs Sokurenko [ 2026 Feb 04 ]

Yes, there are some optimizations planned, I will keep you updated once there is more information.

Generated at Wed May 06 17:54:57 EEST 2026 using Jira 10.3.18#10030018-sha1:5642e4ad348b6c2a83ebdba689d04763a2393cab.