[#ZBX-21866] Preprocessor workers getting stuck on certain items

[ZBX-21866] Preprocessor workers getting stuck on certain items Created: 2022 Nov 04 Updated: 2022 Nov 07 Resolved: 2022 Nov 07
Status:	Closed
Project:	ZABBIX BUGS AND ISSUES
Component/s:	Server (S)
Affects Version/s:	6.0.10
Fix Version/s:	None

Type:

Incident report

Priority:

Trivial

Reporter:

Andrew Boling

Assignee:

Edgars Melveris

Resolution:

Duplicate

Votes:

Labels:

None

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original Estimate:

Not Specified

Environment:

Docker image: zabbix-server-pgsql:ubuntu-6.0.10
DB: PostgreSQL 13.7 / TimescaleDB 2.7.2

Issue Links:

Duplicate
duplicates	~~ZBX-20590~~	preprocessing worker utilization	Closed

Description

Steps to reproduce:

Restart Zabbix server container
Observe normal preprocessor performance. (several thousand items processed per 5s cycle of preprocessing manager)
{{preprocessing manager #1 [queued 99, processed 4804 values, idle 4.963593 sec during 5.019956 sec]
preprocessing manager #1 [queued 0, processed 4426 values, idle 4.967512 sec during 5.017058 sec]
}}
Wait awhile...

Result:

After an unspecified amount of time, the preprocessing threads begin to run hot on CPU and process very few item values per 5 second cycle. Eventually the backlog becomes so great that all active agents begin alerting due to none of their associated items being inserted into the database.

{{
preprocessing manager #1 [queued 66660, processed 6 values, idle 2.757768 sec during 5.001833 sec]
preprocessing manager #1 [queued 133332, processed 5 values, idle 0.000000 sec during 5.003844 sec]
preprocessing manager #1 [queued 103065, processed 0 values, idle 0.000000 sec during 5.002716 sec]

much later
preprocessing manager #1 [queued 64577, processed 21 values, idle 0.000000 sec during 5.001084 sec]
preprocessing manager #1 [queued 16933, processed 106003 values, idle 0.000000 sec during 5.000027 sec]
and later still
preprocessing manager #1 [queued 327975, processed 1 values, idle 0.000000 sec during 5.001024 sec]
}}

The low/zero idle time per cycle combined with the small number of metrics being passed seem to suggest that there is a bottleneck in the preprocessor workers that is causing them to get stuck on specific items. I was not able to find anything in the Zabbix documentation that specifies how to troubleshoot the contents of the preprocessing queue, i.e. which items are the ones that are taking so long to execute.

I'm happy to provide additional information if told what commands to run or where to look inside of the database.

Expected:
See screenshot....
See attached patch file...

Comments

Comment by Andrew Boling [ 2022 Nov 04 ]

I apologize for the trainwreck in formatting. It doesn't look like I have the ability to edit the original submission.

Immediately after restart:

preprocessing manager #1 [queued 99, processed 4804 values, idle 4.963593 sec during 5.019956 sec]
preprocessing manager #1 [queued 0, processed 4426 values, idle 4.967512 sec during 5.017058 sec]

After the problem begins:

preprocessing manager #1 [queued 66660, processed 6 values, idle 2.757768 sec during 5.001833 sec]
preprocessing manager #1 [queued 133332, processed 5 values, idle 0.000000 sec during 5.003844 sec]
preprocessing manager #1 [queued 103065, processed 0 values, idle 0.000000 sec during 5.002716 sec]
# much later
preprocessing manager #1 [queued 64577, processed 21 values, idle 0.000000 sec during 5.001084 sec]
preprocessing manager #1 [queued 16933, processed 106003 values, idle 0.000000 sec during 5.000027 sec]
# and later still
preprocessing manager #1 [queued 327975, processed 1 values, idle 0.000000 sec during 5.001024 sec]

Comment by Andrew Boling [ 2022 Nov 04 ]

Container configuration (DB and MIB details omitted):

ZBX_DBTLSCONNECT=required,
ZBX_ENABLE_SNMP_TRAPS=true,
ZBX_CACHESIZE=256M,
ZBX_HISTORYCACHESIZE=256M,
ZBX_HISTORYINDEXCACHESIZE=16M,
ZBX_HISTORYSTORAGEDATEINDEX=0,
ZBX_TRENDCACHESIZE=16M,
ZBX_VALUECACHESIZE=512M,
ZBX_PROXYCONFIGFREQUENCY=600,
ZBX_LOGSLOWQUERIES=3000,
ZBX_STARTPREPROCESSORS=40,
ZBX_STARTHTTPPOLLERS=20,
ZBX_STARTJAVAPOLLERS=5,
ZBX_STARTVMWARECOLLECTORS=6,
ZBX_JAVAGATEWAY_ENABLE=true,
ZBX_JAVAGATEWAY=192.168.2.1,
ZBX_JAVAGATEWAYPORT=10052,
ZBX_MAXHOUSEKEEPERDELETE=10000,
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin,
TERM=xterm,
ZBX_VERSION=6.0.10,
MIBS=+ALL

Comment by Andrew Boling [ 2022 Nov 05 ]

Further troubleshooting revealed an interesting pattern. The preprocessing manager was running at 100% of a core, but only one of the 40 workers was similarly running at 100%. The other workers were pretty much idle. Examining the data passing through the busy worker thread revealed that it was very slowly iterating over data associated with a http-agent item in a custom template that was feeding data into multiple dependent items.

Once the item was identified, functionality was restored within minutes by disabling that item on all hosts that were linked to the template. A few quick level observations regarding this item:

The item had a five minute polling interval, and was receiving large JSON responses from the /_node/stats/ API of multiple Elasticsearch clusters. In the heaviest case, it was hitting a cluster so large that 25Mb response bodies were being generated. In hindsight this approach needs to be broken out into smaller queries. The cluster has grown significantly since this monitoring was originally implemented. This custom monitoring still exists because, ironically, we have run into preprocessing issues when attempting to use the Elasticsearch template in the Zabbix project's git repo.
A member of the operations team had mistakenly left history storage enabled for this item while troubleshooting an incident. This meant that these large bodies were also being written to the database. That was doubtlessly making this even worse.
The straces were showing that these large response bodies were being passed into the preprocessor worker at an interval that was less than five minutes. We assume this was because the preprocessing queue was backed up so far that the next was ready to be processed by the time the previous had finished. It is unknown why these item payloads continued to come into this one worker back-to-back while the rest of the worker threads were under-utilized.

We are leaving this monitoring disabled for the weekend and I will begin slowly re-enabling it next week after making small performance improvements. I will provide any further insights into the issue we experienced as I work through this.

As noted previously, this is not the first time our environment has observed preprocessor bottlenecks of this nature when http agent items are pointed at our Elasticsearch clusters. The main difference is that this is the first time we have experienced a bottleneck this severe while using our own templates instead of the template provided in the Zabbix repo. This problem has been observed both on the server that is currently hosting the Zabbix server, as well as within a development proxy we set up for this purpose that lived within a Kubernetes cluster. (on separate hardware) Ubuntu docker images were used in all cases and I do not believe we have attempted to reproduce this with the alpine images yet.

Comment by Andrew Boling [ 2022 Nov 06 ]

I'm pretty sure that this is a duplicate of ~~ZBX-20590~~.

Generated at Sat May 09 18:10:32 EEST 2026 using Jira 10.3.18#10030018-sha1:5642e4ad348b6c2a83ebdba689d04763a2393cab.

[ZBX-21866] Preprocessor workers getting stuck on certain items Created: 2022 Nov 04 Updated: 2022 Nov 07 Resolved: 2022 Nov 07

Immediately after restart:

After the problem begins: