[ZBX-24504] Proxy monitoring kubernetes cluster is too resource hungry Created: 2024 May 17 Updated: 2025 Mar 27 |
|
Status: | Confirmed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Templates (T) |
Affects Version/s: | 6.4.13, 6.4.14 |
Fix Version/s: | None |
Type: | Problem report | Priority: | Major |
Reporter: | Robin Roevens | Assignee: | Zabbix Integration Team |
Resolution: | Unresolved | Votes: | 5 |
Labels: | Proxy, Template, helm, kubernetes | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
SLES 15 SP5 |
Attachments: |
![]() |
||||
Issue Links: |
|
Description |
We have set up a few K3s clusters, monitored by Zabbix using the Zabbix provided Kubernetes templates (updated to latest available for git.zabbix.com tag 6.4.14) and helm charts. Most clusters do not yet have much workload, and are monitored without problems using these resource limits on the proxy pod in the cluster:
However one cluster (also 3 cp's, 4 workers) has a much higher load of 192 deployments, 523 replicasets (most of them are 'old', inactive sets that are keps by kubernetes for deployment rollback purposes) and 221 pods. I don't think this is already a very big load for a kubernetes cluster, and there are also about 120 additional projects scheduled to be deployed on that cluster, so the load will only grow. I have already altered the Kubernetes cluster state template to skip replicasets with 0 replica's to filter out the inactive replicasets preventing about 1500 extra items from being discovered. This currently results in Zabbix trying to monitor +/- 26000 items on 29 hosts for that cluster. This however proves to be very challenging for the Proxy on that cluster as the 3 default preprocessors are 100% busy and the proxy crashes regularly due to a lack of available preprocessor workers. So we start scaling up the proxy. Currently it is set to I could still up the allowed resources, preprocessors and pollers. But I think this getting way out of proportion, the consumed resources, only for monitoring. Also it strikes me that with only the defaults it is able to monitor 18000 items, but only 8000 more items would suddenly need over 60 preprocessors more, and at least 3 times the default resource limits. This looks to me like there is some turnover point where resources are suddenly exponentially required for only a little more data. The culprit seems to be the Kubernetes cluster state by HTTP template: Get state metrics which currently returns 37113 lines on that cluster and is then preprocessed by almost all other discovered items. Possibly the preprocessing can be optimized? And/or the template itself? |
Comments |
Comment by Vladimir Povetkin [ 2025 Jan 20 ] |
We have similar issue. Preprocessor queue growing for K8S cluster with 30k items |
Comment by Mateusz Mazur [ 2025 Mar 24 ] |
same problem here |
Comment by Samuele Bianchi [ 2025 Mar 26 ] |
I can confirm this problem using zabbix 7.0.10 release on Debian 12
I have increased the preprocessing number of instances to 30, but the problem is still present. |
Comment by Artūras Kupčinskas [ 2025 Mar 27 ] |
Hi, First we change deployment and replicaset triggers on template (because those data is stuck on postprocesing) :
old: and last(/Kubernetes cluster state by HTTP without POD/kube.replicaset.replicas_desired{#NAMESPACE}/{#NAME})>=0 and last(/Kubernetes cluster state by HTTP without POD/kube.replicaset.ready{#NAMESPACE}/{#NAME})>=0 new: abs(min(/Kubernetes cluster state by HTTP without POD/kube.replicaset.replicas_desired{#NAMESPACE}/{#NAME},{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD:"replicaset:{#NAMESPACE} :{#NAME}"}) - min(/Kubernetes cluster state by HTTP without POD/kube.replicaset.ready{#NAMESPACE}/{#NAME},{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD:"replicaset: {#NAMESPACE}:{#NAME}"}))>0and last(/Kubernetes cluster state by HTTP without POD/kube.replicaset.replicas_desired{#NAMESPACE}/{#NAME})>=0 and last(/Kubernetes cluster state by HTTP without POD/kube.replicaset.ready{#NAMESPACE}/{#NAME})>=0 old: min(/Kubernetes cluster state by HTTP without POD/kube.deployment.replicas_mismatched{#NAMESPACE}/{#NAME},{$KUBE.REPLICA.MISMATCH.EVAL_PERIOD:"deployment:{#NAMESPACE} :{#NAME}"})>0 new: :{#NAME}"}))>0
We also created a second helm yaml to have two separate zabbix-proxy. We ran everything related to Pod discovery through one, and everything else in the other proxy. This solved the resource issue for us. |