[ZBX-21317] High preprocessing manager load after upgrading from 5.2 to 6.0.6 Created: 2022 Jul 10 Updated: 2022 Nov 16 Resolved: 2022 Nov 14 |
|
| Status: | Closed |
| Project: | ZABBIX BUGS AND ISSUES |
| Component/s: | Server (S) |
| Affects Version/s: | 6.0.6, 6.2.0 |
| Fix Version/s: | None |
| Type: | Problem report | Priority: | Major |
| Reporter: | Alexandr Paliy | Assignee: | Andrejs Sitals (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 4 |
| Labels: | management, preprocessing | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Sprint: | Support backlog | ||||||||
| Description |
|
Steps to reproduce:
------------------- Hi. I was pretty close to just post a comment in ZBX-20590, but, being unsure is it the same issue or not, I've decided to create a new report.
I had performed 5.2.7 -> 6.0.6 upgrade a couple of days ago, and I faced a trouble: with preprocessing queue - it started to grow constantly. Symptoms:
I've managed to figure out, that if I disable "ClickHouse by HTTP" template (it was used for something like 15 hosts before, and did not cause any troubles with zabbix-server 5.2), then the problem goes away. In the end, I've "solved" the issue by decreasing the rate of master item check in clickhouse template - from 1 minute to 10 minutes. Processing queue still spikes a bit every 10 minutes, but it goes away quickly afterwards.
In ZBX-20590, author mentions that his preprocessing workers got 100% utilization - this is not my case. In my case, it looks to me like preprocessing manager process was the problem, being not able to handle all the incoming data in real time and/or split it equally between all available workers. I did not check preprocessing workers utilization per-individual-process, but I did check "TIME" (in linux "top" terminology) for them. Only first couple of worker processes had non-zero values, every other workers had zero - I think, that's why raising workers count from 10 to 100 did not make any difference.
I can provide any additional details (or try to perform any additional tests) you need, except for downgrading back to 5.x . |
| Comments |
| Comment by Edgar Akhmetshin [ 2022 Jul 11 ] |
|
Hi, Could you please give some example of data which was retrieved by master items for this hosts? Also if possible get DebugLevel 4 for a Preprocessing Manager during such issue and output from: strace -c -w -f -p PID Strace should be gathered during normal logging without Debug. Regards, |
| Comment by Alexandr Paliy [ 2022 Jul 12 ] |
|
I was pretty close to post a couple of master items' data examples, until I've finally realized what is wrong in my case. As I've initially mentioned, I have around 15 hosts with ClickHouse template enabled. For most of them, this template creates something like 80 items in total (I always checked this number it for, like, a couple of topmost couple of visible hosts, and considered this as true for the rest too, since... well, it's the same template). Until now, when I've noticed there are actually 3 hosts with thousands of items (4.5k, 6k, 10.5k). These additional items come from "Tables" discovery (part of "ClickHouse by HTTP" template), and this discovery, being enabled by default, works correctly only on 3 of my hosts because... it's master item exceeds request timeout (3 seconds by default) on the rest of the hosts. I feel ashamed, sorry
Coming back to the "problem", though I'm not sure I still have a right to call it like that:
I assume, these reasons are enough to ignore other "master items" values at this point, and discuss only the large (and problematic) one - "clickhouse.tables". I've attached an example of it's value: clickhouse.tables_example.txt
Question "But how did it work before the upgrade, then?" comes to my mind. I'm not sure, if it was working before the upgrade at all. Master items do not store history by default in this template, so I can't check, did they even get their values (at least for these 3 hosts) before the upgrade, or not. The history for discovered dependent items is stored for 7d by default, though, and tables discovery has "Keep lost resources period" set to 30d.
Since it's, clearly, a misconfiguration (or, "lazy configuration" |
| Comment by Andrew Boling [ 2022 Nov 06 ] |
|
I'm not a Zabbix dev, but I'm pretty sure you are in fact discussing the same issue that is described in |
| Comment by Andrejs Sitals (Inactive) [ 2022 Nov 14 ] |
|
I'm not able to get "preprocessing manager utilization goes to 100%" on Zabbix 6.0.6 by just sending the attached JSON to clickhouse.tables items on multiple hosts. Although preprocessing takes ages, zabbix[process,preprocessing manager,avg,busy] stays well below 1%. As for the performance issues, there are few optimizations to JSONpath preprocessing steps ( For comparison - when sending the attached JSON to clickhouse.tables item on a single host 10 times in a row, it takes almost 10 minutes to process on 6.0.6, and roughly 15 seconds on 6.0.11rc1. |