[ZBX-21317] High preprocessing manager load after upgrading from 5.2 to 6.0.6 Created: 2022 Jul 10  Updated: 2022 Nov 16  Resolved: 2022 Nov 14

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 6.0.6, 6.2.0
Fix Version/s: None

Type: Problem report Priority: Major
Reporter: Alexandr Paliy Assignee: Andrejs Sitals (Inactive)
Resolution: Cannot Reproduce Votes: 4
Labels: management, preprocessing
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File clickhouse.tables_example.txt    
Issue Links:
Sub-task
part of ZBX-20590 preprocessing worker utilization Closed
Sprint: Support backlog

 Description   

Steps to reproduce:

  1. Use "ClickHouse by HTTP" template for 15 hosts
  2. Upgrade zabbix server from 5.2 to 6.0.6
  3. Get troubles with preprocessing queue, also getting any other items' values stuck in the queue

-------------------

Hi. I was pretty close to just post a comment in ZBX-20590, but, being unsure is it the same issue or not, I've decided to create a new report.

 

I had performed 5.2.7 -> 6.0.6 upgrade a couple of days ago, and I faced a trouble: with preprocessing queue - it started to grow constantly.

Symptoms:

  • Preprocessing queue - growing uncontrollably
  • Preprocessing manager utilization - slowly, but surely was growing to 100%
  • Preprocessing workers utilization - I had 10 workers initially, and utilization did not raise higher than 40-50%. I did not try lowering the workers count, but I did try incresing it (up to 75-100) - did not solve the issue.
  • The worst part: looks like ALL items' values became stuck in the preprocessing queue. I'm not talking about dependent items only, or about items only generated by ClickHouse template - I'm talking about any existing zabbix items. Analyzing the output of "zabbix_server -R diaginfo=preprocessing" when the queue was full, "top values" were the values of "system.uptime" items from completely unrelated hosts, which are checked every 30s by default in "Linux by Zabbix Agent" template.

 

I've managed to figure out, that if I disable "ClickHouse by HTTP" template (it was used for something like 15 hosts before, and did not cause any troubles with zabbix-server 5.2), then the problem goes away. In the end, I've "solved" the issue by decreasing the rate of master item check in clickhouse template - from 1 minute to 10 minutes. Processing queue still spikes a bit every 10 minutes, but it goes away quickly afterwards.

 

In ZBX-20590, author mentions that his preprocessing workers got 100% utilization - this is not my case. In my case, it looks to me like preprocessing manager process was the problem, being not able to handle all the incoming data in real time and/or split it equally between all available workers.

I did not check preprocessing workers utilization per-individual-process, but I did check "TIME" (in linux "top" terminology) for them. Only first couple of worker processes had non-zero values, every other workers had zero - I think, that's why raising workers count from 10 to 100 did not make any difference.

 

I can provide any additional details (or try to perform any additional tests) you need, except for downgrading back to 5.x .



 Comments   
Comment by Edgar Akhmetshin [ 2022 Jul 11 ]

Hi,

Could you please give some example of data which was retrieved by master items for this hosts?

Also if possible get DebugLevel 4 for a Preprocessing Manager during such issue and output from:

strace -c -w -f -p PID

Strace should be gathered during normal logging without Debug.

Regards,
Edgar

Comment by Alexandr Paliy [ 2022 Jul 12 ]

I was pretty close to post a couple of master items' data examples, until I've finally realized what is wrong in my case.

As I've initially mentioned, I have around 15 hosts with ClickHouse template enabled. For most of them, this template creates something like 80 items in total (I always checked this number it for, like, a couple of topmost couple of visible hosts, and considered this as true for the rest too, since... well, it's the same template). Until now, when I've noticed there are actually 3 hosts with thousands of items (4.5k, 6k, 10.5k). These additional items come from "Tables" discovery (part of "ClickHouse by HTTP" template), and this discovery, being enabled by default, works correctly only on 3 of my hosts because... it's master item exceeds request timeout (3 seconds by default) on the rest of the hosts. I feel ashamed, sorry This particular template was never "critical" in our case, so nobody ever touched default settings or bothered checking "do all of it's contents work 100% correctly or not".

 

Coming back to the "problem", though I'm not sure I still have a right to call it like that:

  • For most of those "master items", which work correctly on all of my hosts (so, all master items, except for "clickhouse.tables"), value looks like JSON-file with something like 200 lines at maximum. On the other hand, master item for "Tables" discovery ("clickhouse.tables"), also being JSON, has thousands of lines.
  • Initially, when I was trying to "debug" the problem, and, as a part of that, tried to use "log_level_increase" runtime control option for zabbix_server - I definitely saw the contents of this LARGE json most of the time.

I assume, these reasons are enough to ignore other "master items" values at this point, and discuss only the large (and problematic) one - "clickhouse.tables". I've attached an example of it's value: clickhouse.tables_example.txt(24k lines, this comes from the host with 10.5k items I've mentioned above). This master item currently has 10366 dependent items.

 

Question "But how did it work before the upgrade, then?" comes to my mind. I'm not sure, if it was working before the upgrade at all. Master items do not store history by default in this template, so I can't check, did they even get their values (at least for these 3 hosts) before the upgrade, or not. The history for discovered dependent items is stored for 7d by default, though, and tables discovery has "Keep lost resources period" set to 30d.

  • Since all other hosts (except for these 3) currently do not have any dependent items for this "tables" master item (I've just checked it in the DB) - looks like tables discovery did not work both before and after the upgrade for them.
  • Since existing dependent items for this "table" discovery (for the 3 hosts where it works) did not have any values before the Zabbix Server upgrade (I've checked a couple of discovered items) - looks like discovery did not work before the upgrade for these 3 hosts, but, for some reason, started working after the upgrade.

 

Since it's, clearly, a misconfiguration (or, "lazy configuration" ) from me, and I totally underestimated how many values did my zabbix server instance try to preprocess - I'm not sure it makes sense to further discuss this issue. On the other hand, no matter how bad/large the incoming data is, the situation where preprocessing manager utilization goes to 100%, causing problems for all existing items, but preprocessing workers are still not utilized to 100% - I'm not sure, is it "expected behavior", or not. Please, let me know, does it make sense for me at this point to perform "strace" you've mentioned, or should this report just be closed instead.

Comment by Andrew Boling [ 2022 Nov 06 ]

I'm not a Zabbix dev, but I'm pretty sure you are in fact discussing the same issue that is described in ZBX-20590. The problem is slightly confusing to users because even though one worker is stalling the monitoring system, it gets represented as 100% on the "Utilization of preprocessing manager internal processes, in %" item in the "Zabbix server health" template. It isn't apparent that the problem is a single worker unless you start looking at the CPU utilization of the individual worker threads and debug the saturated worker with logging/strace.

Comment by Andrejs Sitals (Inactive) [ 2022 Nov 14 ]

I'm not able to get "preprocessing manager utilization goes to 100%" on Zabbix 6.0.6 by just sending the attached JSON to clickhouse.tables items on multiple hosts. Although preprocessing takes ages, zabbix[process,preprocessing manager,avg,busy] stays well below 1%.

As for the performance issues, there are few optimizations to JSONpath preprocessing steps (ZBXNEXT-8009, ZBXNEXT-8040) which will be available in 6.0.11. They're already available in 6.0.11rc1. These optimizations help when there are a lot of dependent items with JSONpath as their first preprocessing step and have a "simple enough" expressions. I don't know exact requirements for the expression, but it should work for most, if not all, Zabbix templates.

For comparison - when sending the attached JSON to  clickhouse.tables item on a single host 10 times in a row, it takes almost 10 minutes to process on 6.0.6, and roughly 15 seconds on 6.0.11rc1.

Generated at Sat May 09 18:10:31 EEST 2026 using Jira 10.3.18#10030018-sha1:5642e4ad348b6c2a83ebdba689d04763a2393cab.