[ZBX-21866] Preprocessor workers getting stuck on certain items Created: 2022 Nov 04 Updated: 2022 Nov 07 Resolved: 2022 Nov 07 |
|
| Status: | Closed |
| Project: | ZABBIX BUGS AND ISSUES |
| Component/s: | Server (S) |
| Affects Version/s: | 6.0.10 |
| Fix Version/s: | None |
| Type: | Incident report | Priority: | Trivial |
| Reporter: | Andrew Boling | Assignee: | Edgars Melveris |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Docker image: zabbix-server-pgsql:ubuntu-6.0.10 |
||
| Issue Links: |
|
||||||||
| Description |
|
Steps to reproduce:
Result: After an unspecified amount of time, the preprocessing threads begin to run hot on CPU and process very few item values per 5 second cycle. Eventually the backlog becomes so great that all active agents begin alerting due to none of their associated items being inserted into the database. {{
The low/zero idle time per cycle combined with the small number of metrics being passed seem to suggest that there is a bottleneck in the preprocessor workers that is causing them to get stuck on specific items. I was not able to find anything in the Zabbix documentation that specifies how to troubleshoot the contents of the preprocessing queue, i.e. which items are the ones that are taking so long to execute. I'm happy to provide additional information if told what commands to run or where to look inside of the database. Expected: |
| Comments |
| Comment by Andrew Boling [ 2022 Nov 04 ] |
|
I apologize for the trainwreck in formatting. It doesn't look like I have the ability to edit the original submission. Immediately after restart:preprocessing manager #1 [queued 99, processed 4804 values, idle 4.963593 sec during 5.019956 sec] After the problem begins:preprocessing manager #1 [queued 66660, processed 6 values, idle 2.757768 sec during 5.001833 sec] |
| Comment by Andrew Boling [ 2022 Nov 04 ] |
|
Container configuration (DB and MIB details omitted): ZBX_DBTLSCONNECT=required, |
| Comment by Andrew Boling [ 2022 Nov 05 ] |
|
Further troubleshooting revealed an interesting pattern. The preprocessing manager was running at 100% of a core, but only one of the 40 workers was similarly running at 100%. The other workers were pretty much idle. Examining the data passing through the busy worker thread revealed that it was very slowly iterating over data associated with a http-agent item in a custom template that was feeding data into multiple dependent items. Once the item was identified, functionality was restored within minutes by disabling that item on all hosts that were linked to the template. A few quick level observations regarding this item:
We are leaving this monitoring disabled for the weekend and I will begin slowly re-enabling it next week after making small performance improvements. I will provide any further insights into the issue we experienced as I work through this. As noted previously, this is not the first time our environment has observed preprocessor bottlenecks of this nature when http agent items are pointed at our Elasticsearch clusters. The main difference is that this is the first time we have experienced a bottleneck this severe while using our own templates instead of the template provided in the Zabbix repo. This problem has been observed both on the server that is currently hosting the Zabbix server, as well as within a development proxy we set up for this purpose that lived within a Kubernetes cluster. (on separate hardware) Ubuntu docker images were used in all cases and I do not believe we have attempted to reproduce this with the alpine images yet. |
| Comment by Andrew Boling [ 2022 Nov 06 ] |
|
I'm pretty sure that this is a duplicate of |