Steps to reproduce:
- Use "ClickHouse by HTTP" template for 15 hosts
- Upgrade zabbix server from 5.2 to 6.0.6
- Get troubles with preprocessing queue, also getting any other items' values stuck in the queue
Hi. I was pretty close to just post a comment in ZBX-20590, but, being unsure is it the same issue or not, I've decided to create a new report.
I had performed 5.2.7 -> 6.0.6 upgrade a couple of days ago, and I faced a trouble: with preprocessing queue - it started to grow constantly.
- Preprocessing queue - growing uncontrollably
- Preprocessing manager utilization - slowly, but surely was growing to 100%
- Preprocessing workers utilization - I had 10 workers initially, and utilization did not raise higher than 40-50%. I did not try lowering the workers count, but I did try incresing it (up to 75-100) - did not solve the issue.
- The worst part: looks like ALL items' values became stuck in the preprocessing queue. I'm not talking about dependent items only, or about items only generated by ClickHouse template - I'm talking about any existing zabbix items. Analyzing the output of "zabbix_server -R diaginfo=preprocessing" when the queue was full, "top values" were the values of "system.uptime" items from completely unrelated hosts, which are checked every 30s by default in "Linux by Zabbix Agent" template.
I've managed to figure out, that if I disable "ClickHouse by HTTP" template (it was used for something like 15 hosts before, and did not cause any troubles with zabbix-server 5.2), then the problem goes away. In the end, I've "solved" the issue by decreasing the rate of master item check in clickhouse template - from 1 minute to 10 minutes. Processing queue still spikes a bit every 10 minutes, but it goes away quickly afterwards.
In ZBX-20590, author mentions that his preprocessing workers got 100% utilization - this is not my case. In my case, it looks to me like preprocessing manager process was the problem, being not able to handle all the incoming data in real time and/or split it equally between all available workers.
I did not check preprocessing workers utilization per-individual-process, but I did check "TIME" (in linux "top" terminology) for them. Only first couple of worker processes had non-zero values, every other workers had zero - I think, that's why raising workers count from 10 to 100 did not make any difference.
I can provide any additional details (or try to perform any additional tests) you need, except for downgrading back to 5.x .