-
Incident report
-
Resolution: Duplicate
-
Trivial
-
None
-
6.4.5
-
None
This issue is very similar to:
[ZBX-20590] preprocessing worker utilization - ZABBIX SUPPORT
ZBX-23012 Slow query from LLD worker - ZABBIX SUPPORT
NOTE: Below only describes how my team found about this issue in our specific use case, but the issue is wider than the specific templates I have mentioned.
We are currently using Kubernetes templates to discover pods etc in some of our clusters, which are quite big, and where some nodes (call them batch job nodes) constantly are redeployed, so new pods are created, which means they are discovered etc. So now, we ended up with some hosts with 28,000+ items.
We then discovered the LLD queue was forever growing and never completed. We checked our config and saw we have 20+ LLD workers configured, but only 1-2 workers are actually doing work and are pinning CPU threads to 100%. After doing some debugging and seeing what items the LLD workers were actually working on, it was the kubernetes pod related ones. **
Unfortuantely, kube-state-metrics doesn't give us a way to filter pods by host/annotation/node group etc, so we are working on some internal solution for that.
But the point of this issue is, the LLD workers exhibit the same issue as per ZBX-20590 preprocessing worker utilization - ZABBIX SUPPORT where the preprocessing workers pin the CPU and the queue grows.
The "quick" fix is (again I am just assuming here) to NOT rely on a parent item for every discovery and instead perform the discovery every time for each discovery item. That way, a different thread should pick up the job. Yes, it means significantly more calls to the initial API/target (in our case, kube-state-metrics) but in my case I don't mind/I am happy to compromise that to ensure the queue actually gets cleared.
Happy to provide further details/screenshots where needed.
- duplicates
-
ZBX-23012 Slow query from LLD worker
- Closed