[ZBX-24311] Slow queries that lock history syncers when old metrics are pushed to Zabbix server trappers Created: 2024 Apr 04 Updated: 2025 Aug 25 Resolved: 2025 Aug 09 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 6.4.12 |
Fix Version/s: | 7.0.18rc1, 7.4.2rc1, 8.0.0alpha1 (master) |
Type: | Problem report | Priority: | Minor |
Reporter: | Ryan Eberly | Assignee: | Vladislavs Sokurenko |
Resolution: | Fixed | Votes: | 2 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | 26h | ||
Original Estimate: | Not Specified |
Attachments: |
![]() ![]() ![]() ![]() |
||||||||||||
Issue Links: |
|
||||||||||||
Team: | |||||||||||||
Sprint: | Prev.Sprint | ||||||||||||
Story Points: | 1 |
Description |
Create thousands of hosts with hundreds of items with history and trend storage. push metrics via trapper to the hosts with current timestamps on the items. Wait 3 or 4 hours. Push more metrics to many of the items across many of the hosts with timestamps (clock) of 3 or 4 hours ago. This generates one or more slow queries like so:
slow query: 833.4347 sec, "select itemid,num,value_min,value_avg,value_max from trends where clock=<TIME_FROM_3-4_HOURS_AGO> and (itemid in (1,2,3,4,5,6,7....) or itemid in (100,101,102,103....) or itemid in (....) or itemid in (...)) order by itemid,clock;"
Before the select query you can also observe several other slow queries occurring before and after like so:
slow query: 300.56674 sec, "update trends_uint set num=3,value_min=100,value_avg=100,value_max=100 where itemid=123456 and clock=<TIME_FROM_3-4_HOURS_AGO>;"
Depending on the timestamps of the metrics being pushed to the trappers, the queries above can repeat a few times for each trends hour that Zabbix is trying to update trends for. I understand these queries, but these queries put locks on the history syncers and then they are unable to sync any other data. Is there a better way to parallelize this to prevent history syncer usage from hitting 100% and allow the other history syncers to continue processing? |
Comments |
Comment by Vladislavs Sokurenko [ 2024 Apr 10 ] |
Thank you for your report, yes it should be possible to move trend updates from history syncer to other process so that history syncer wouldn't get blocked. |
Comment by Ryan Eberly [ 2024 Jul 31 ] |
Reopening. The issue that says it fixed this was in 6.4.8, but the version we're running is 6.4.12 and the long running queries still happen. My specific issue is not fixed. |
Comment by Ryan Eberly [ 2025 Mar 21 ] |
vso - Do any of the fixes in |
Comment by Vladislavs Sokurenko [ 2025 Mar 21 ] |
If TimescaleDB is used it might help because of upsets. |
Comment by Ryan Eberly [ 2025 Jul 23 ] |
vso - I was looking at this issue again on one of my very large deployments that ingests about 7000 Nvps, running on MySQL 8.0.45, on version 7.2.7 and the range query I referenced above as taking very long I believe has more to do with the size/length of the query causing the optimizer to run a full table scan instead of loading the range queries into memory. By default, this value in MySQL is set to low: range_optimizer_max_mem_size = 8388608 If I set the value to 256MB the extremely long range queries complete very fast. range_optimizer_max_mem_size = 268435456
The range queries with the long IN and OR chains are seen particularly during server shutdown during trend synchronization, as well as occasionally while running - particularly when a large number of metrics are ingested, as regarded in the initial ticket description. My thoughts are that zabbix_server should break up these range queries into smaller chunks, use UNIONs or possibly recommend in the Zabbix manual that the range_optimizer_max_mem_size setting should strongly be considered to be set higher than the default of 8MB. What do you think? |
Comment by Ryan Eberly [ 2025 Jul 23 ] |
One more remark (though I know you don't work on the UI!) - The same kind of range queries with long IN and OR chains is present in some UI code as well. Specifically in CHistoryManager::getAggregatedValues(). These queries will benefit from being chunked up into smaller queries and appended to the array. Otherwise, certain UI queries, trend synchronization, etc. can take several minutes to complete. |
Comment by Vladislavs Sokurenko [ 2025 Jul 23 ] |
Fix regarding upserts in |
Comment by Ryan Eberly [ 2025 Jul 23 ] |
vso - yes, I can try a patch as long as it's for 7.2. I haven't done all my testing for 7.4 yet. I presume this part of C code is mostly the same between versions 6.0, 7.0, 7.2, 7.4 though ..
With a patch I presume you would also recommend setting the range_optimizer_max_mem_size back to the MySQL default of 8MB? |
Comment by Ryan Eberly [ 2025 Jul 23 ] |
I build from sources, so a patch file showing diffs or a binary of zabbix_server works. Whatever is easier. If upserts for MySQL works then this will be a huge improvement for very large instances! |
Comment by Vladislavs Sokurenko [ 2025 Jul 23 ] |
There is already existing branch for MySQL please check https://git.zabbix.com/projects/ZBX/repos/zabbix/pull-requests/10169/overview |
Comment by Ryan Eberly [ 2025 Jul 23 ] |
Thanks - I'll implement today and let it run for a few days. I have a second server getting the same data that I will leave un-patched and with the default 8MB range_optimizer_max_mem_size MySQL setting |
Comment by Ryan Eberly [ 2025 Jul 24 ] |
vso - A small update for you. I encountered more of the slow range queries overnight, as well as some "INSERT INTO" statements. The order that I see this occurring (based on the zabbix_server.log running a DebugLevel 3) Was:
1949940:20250723:230159.338 slow query: 300.618915 sec, "insert into trends_uint (itemid,clock,num,value_min,value_avg,value_max) values (<values>) on duplicate key update num=VALUES(num),value_min=VALUES(value_min),value_avg=VALUES(value_avg),value_max=VALUES(value_max);." 1949934:20250723:230159.338 slow query: 300.619411 sec, "insert into trends_uint (itemid,clock,num,value_min,value_avg,value_max) values (<values>) on duplicate key update num=VALUES(num),value_min=VALUES(value_min),value_avg=VALUES(value_avg),value_max=VALUES(value_max);." 1949939:20250723:230159.338 slow query: 300.573360 sec, "insert into trends_uint (itemid,clock,num,value_min,value_avg,value_max) values (<values>) on duplicate key update num=VALUES(num),value_min=VALUES(value_min),value_avg=VALUES(value_avg),value_max=VALUES(value_max);." 1949933:20250723:230159.338 slow query: 300.466750 sec, "insert into trends_uint (itemid,clock,num,value_min,value_avg,value_max) values (<values>) on duplicate key update num=VALUES(num),value_min=VALUES(value_min),value_avg=VALUES(value_avg),value_max=VALUES(value_max);." 1949936:20250723:230202.979 slow query: 420.466750 sec, "select itemid,num,value_min,value_avg,value_max from trends where clock=1753304400 and (itemid in (...,...,...) or itemid in (...,...,...) or itemid in (...,...,...))" The select query above seems to be the culprit and appears to be either locking the database or the other 4 history syncers performing the INSERTs. I was fortunate enough to catch one of these queries in progress today and did the following:
mysql> SET GLOBAL range_optimizer_max_mem_size = 268435456; mysql> kill query <thread_id of select itemid,num,value_min,value_avg,value_max from trends query> /**the command above triggered a rollback**/ the select appeared to run again and complete in milliseconds with the new range_optimizer_max_mem_size. Once that select statement finished, the other 4 history syncer processes completed their INSERTs. I counted the number of chained lists (each 'or itemid in (val1,val2,val3,...)' in some of the select statements and there were 92 of them with 1000 itemids in each list (approximately 92000 itemids total). I think this is too expensive for the default range optimizer setting in mysql and seems to cause the history syncer select queries to fallback to a full table scan. For small Zabbix instances, this probably won't become an issue because there won't be a scenario where tens of thousands or hundreds of thousands of itemids will be queried. But for large instances, especially for instances where there is some custom code processing files containing metrics and then pushing the trappers, this scenario can happen Why? because during network interruptions, the metrics file might be delayed (this is the scenario I have seen many times) and when they are delayed a backlog of metrics are processed at the same time and these long range queries happen.
Is it possible for the history syncers to have a limit on how many chained lists can be in those select queries? Otherwise, I think adjusting range_optimizer_max_mem_size is an excellent solution.
|
Comment by Ryan Eberly [ 2025 Jul 24 ] |
According to the mysql documentation with queries like this: WHERE itemid IN (1, 2, 3, ..., 1000) OR itemid IN (1001, 1002, 1003, ..., 2000) OR ... -- 92 such IN() lists And each predicate uses approximately 230 bytes then: Memory = 92,000 predicates × 230 bytes ≈ 21,160,000 bytes ≈ 20.18 MB
So, the default of 8MB in MySQL is probably sufficient for most Zabbix instances. 92000 itemids in a single range query by the history syncer during trends flush/sync is probably more than most deployments. And this is still not a frequent occurrence anyway. This only happens when there is a metric backlog. During normal operations the range optimizer memory usage is probably well under the default MySQL setting. |
Comment by Vladislavs Sokurenko [ 2025 Jul 24 ] |
It shouldn't have taken more than 1000 ids we will look into it and get back to you. Is it query during operation or when stopping ? |
Comment by Ryan Eberly [ 2025 Jul 24 ] |
in the case above it was during operation. I can try a shutdown to see what happens. The problem is, the long range queries don't always happen on shutdown. I believe that trends are supposed to be calculated and written once per hour, correct? I thought I read in the manual a while ago that at 5 minutes after the hour this happens. Though, I'm guessing the trends write/flush is more complex than that. In order to best test this during a shutdown is there an optimal time during the hour to send a SIGTERM to zabbix server to attempt to cause these long range queries? |
Comment by Vladislavs Sokurenko [ 2025 Jul 24 ] |
Most likely it will be fixed with following patch but I will try to reproduce and get back to you after I test the patch: diff --git a/src/libs/zbxcachehistory/cachehistory.c b/src/libs/zbxcachehistory/cachehistory.c index 2a4b4cb48a4..2c6c3335a83 100644 --- a/src/libs/zbxcachehistory/cachehistory.c +++ b/src/libs/zbxcachehistory/cachehistory.c @@ -679,7 +679,19 @@ void zbx_db_flush_trends(ZBX_DC_TREND *trends, int *trends_num, zbx_vector_uint6 inserts_num++; if (0 != trend->disable_from) + { + if (clock >= trend->disable_from) + continue; + + selects_num++; + if (ZBX_HC_SYNC_MAX == selects_num) + { + trends_to = i + 1; + break; + } + continue; + } if (NULL != trends_diff) uint64_array_add(&itemids, &itemids_alloc, &itemids_num, trend->itemid, 64); |
Comment by Ryan Eberly [ 2025 Jul 24 ] |
After adjusting the mysql range optimizer memory i don't see frequent slow queries for the select statements, but I do continue to see several 300 second slow queries for "update trends set...." that ultimately get rolled back because they're waitiing for something else to complete. I could start looking around the performance_schema database to see what they're waiting on, but right now I see 4 of those statements waiting (one per history syncer), but I also see one history syncer (in top) at 100% cpu usage. So, my guess is that fifth history syncer is doing something that is causing the lock_wait |
Comment by Ryan Eberly [ 2025 Jul 24 ] |
Unfortunately when i do:
./zabbix_server -c <conf_file> prof_enable="history syncer,5"
nothing is getting written to the zabbix_server.log strace -p <history syncer #5 pid> doesn't output much. It only wrote once to the screen and shows that its trying to run an "update trends set num=" |
Comment by Ryan Eberly [ 2025 Jul 24 ] |
I ended up sending that history syncer a kill -9 signal. Immediately the other 4 history syncers were able to complete their queries quickly. |
Comment by Vladislavs Sokurenko [ 2025 Jul 24 ] |
Branch feature/ Issue is reproducible if sent values from present then + 1 hour and then again -1h and stop the Zabbix server and observe large queries. |
Comment by Vladislavs Sokurenko [ 2025 Jul 24 ] |
Fixed in pull request feature/ZBX-24311-7.0 |
Comment by Ryan Eberly [ 2025 Jul 24 ] |
By the way - with the first patch you provided I actually saw better performance during operation today. None of the history syncers got stuck for extended periods of time. The other server that receives the same feed did get stuck, so the patch you provided might provide improved performance. The biggest difference between the two servers is:
Box 1 (patched) - DDR5 RAM rated at 5500MT/s and nVME storage Box 2 (not patched) - DDR4 RAM rated at 2933MT/s and SSD storage
So, box 1 does have much faster memory which might have helped as well. Regardless, I don't see any issues with the patch provided yesterday. |
Comment by Vladislavs Sokurenko [ 2025 Jul 24 ] |
It seems the issue is both for select and insert that it can try to sync more than 1000, patch provided only handles select working on inserts too, though it’s probably not so big problem if inserts are big |
Comment by Ryan Eberly [ 2025 Jul 24 ] |
Thanks vso . |
Comment by Ryan Eberly [ 2025 Jul 24 ] |
I agree, from what I can see right now. The inserts/updates that are "stuck" seem to be stuck because one of the 5 history syncers is doing something. The problem is, when I prof_enable on that one history syncer that has high cpu usage no profiling data is logged in zabbix_server.log. strace indicated that history syncer was also performing an update/insert. So, it appears that possibly one history syncer is locking the others or holding a transaction lock that the others are waiting for in the database. Will the patch also address this issue (if the selects are smaller)? |
Comment by Vladislavs Sokurenko [ 2025 Jul 24 ] |
Yes ZBX-24311-7-2.diff |
Comment by Ryan Eberly [ 2025 Jul 24 ] |
Do you think we also need ZBX-25651-7-2-mysql.diff |
Comment by Vladislavs Sokurenko [ 2025 Jul 25 ] |
Let's see first if limiting selects resolve the issue, inserts were already limited by the way. We can also compare versions with upsert and insert afterwards. Patches can be used together. |
Comment by Ryan Eberly [ 2025 Jul 25 ] |
Ok I will apply just the newest patch on the machine I did not patch yesterday and let you know how things look. |
Comment by Ryan Eberly [ 2025 Jul 25 ] |
vso - just confirming that this patch is the correct one:
diff --git a/src/libs/zbxcachehistory/cachehistory.c b/src/libs/zbxcachehistory/cachehistory.c index ca0d71394c3..6a5ae5ea136 100644 --- a/src/libs/zbxcachehistory/cachehistory.c +++ b/src/libs/zbxcachehistory/cachehistory.c @@ -737,10 +737,15 @@ void zbx_db_flush_trends(ZBX_DC_TREND *trends, int *trends_num, zbx_vector_uint6 inserts_num++; if (0 != trend->disable_from) - continue; - - if (NULL != trends_diff) - uint64_array_add(&itemids, &itemids_alloc, &itemids_num, trend->itemid, 64); + { + if (clock >= trend->disable_from) + continue; + } + else + { + if (NULL != trends_diff) + uint64_array_add(&itemids, &itemids_alloc, &itemids_num, trend->itemid, 64); + } selects_num++; if (ZBX_HC_SYNC_MAX == selects_num)
Not this one: diff --git a/src/libs/zbxcachehistory/cachehistory.c b/src/libs/zbxcachehistory/cachehistory.c index 2a4b4cb48a4..2c6c3335a83 100644 --- a/src/libs/zbxcachehistory/cachehistory.c +++ b/src/libs/zbxcachehistory/cachehistory.c @@ -679,7 +679,19 @@ void zbx_db_flush_trends(ZBX_DC_TREND *trends, int *trends_num, zbx_vector_uint6 inserts_num++; if (0 != trend->disable_from) + { + if (clock >= trend->disable_from) + continue; + + selects_num++; + if (ZBX_HC_SYNC_MAX == selects_num) + { + trends_to = i + 1; + break; + } + continue; + } if (NULL != trends_diff) uint64_array_add(&itemids, &itemids_alloc, &itemids_num, trend->itemid, 64); |
Comment by Vladislavs Sokurenko [ 2025 Jul 25 ] |
Yes, the last one, they are same anyway |
Comment by Ryan Eberly [ 2025 Jul 25 ] |
So far things look better with the latest patch. Our environment is still experiencing delays in metrics delivery and files are coming in out of order which should still be causing this slow query behavior.
Box 1 - DDR3 RAM @2933 MT/s and SSD storage - has this patch only: ZBX-24311-7-2.diff Box 2 - DDR5 RAM @ 5500 MT/s and nVME storage - has this patch only: ZBX-25651-7-2-mysql.diff
In the last two hours Box 1 has seen 0 slow SELECT queries and 8 total slow INSERT/UPDATE trends (1 insert and 7 update) queries. The slow INSERT/UPDATE trends queries are taking anywhere from 42 seconds to 155 seconds, which is better than before where it would eventually reach the lock_wait timeout of 300 seconds we have configured in MySQL In the last two hours Box 2 has seen 4 slow SELECT queries and 46 slow INSERT trends|trends_uint queries. I don't see any slow UPDATE queries. Nearly all of the slow INSERTS reach the lock_wait timeout of 300 seconds and ultimately get rolled back. I'm not sure if the upsert patch change would only positively affect UPDATE queries? |
Comment by Vladislavs Sokurenko [ 2025 Jul 25 ] |
Please post some details about queries, upserts could help if you combine patches. |
Comment by Ryan Eberly [ 2025 Jul 25 ] |
I was able to catch some of the select queries and I only see 1000 itemids per query. This system is super busy with high volume so I was hesitant to turn on general logging. Was there something in particular you wanted to see? Or just confirmation that we no longer see select queries with tens of thousands of itemids? |
Comment by Ryan Eberly [ 2025 Jul 26 ] |
Overnight the result looks very positive for the fix in [^ZBX-24311-7-2.diff.] I see 0 slow running SELECTS or INSERTS/UPDATES. And, usually the biggest sign that there is a lock condition in the history syncers is the history cache fills up. On Box 1 the history cache never filled up, But on Box2 it filled up 8 times in 12 hours, which as you know results in blocking to the trapper ports and eventually prevents new metrics from coming in via the agents or any other means. Since we are still having network issues and there is a heavy backlog of metrics I am going to merge the two patches and deploy on both systems. Will follow up with those results by Monday. While applying the merged patch I had to shutdown zabbix server: As expected, Box 2 took about 30 minutes to shutdown beause one history syncer thread was stuck at 100% CPU usage (locking the other syncers and database). The main zabbix_server thread, according to gdb, was in zbx_db_flush_trends() each time I checked the backtrace. Box 1 took 110 seconds to sync trends during shutdown. This is much improved. Additionally, the "syncing trend data..." log percentage was much more meaningful on Box 1. These are all good signs. |
Comment by Vladislavs Sokurenko [ 2025 Jul 26 ] |
Thanks, seeing how it behaves with both patches and shutdown could be nice |
Comment by Ryan Eberly [ 2025 Jul 26 ] |
Ok, I decided to stick around and monitor this some more There are still a few slow queries, but they only start right around the start of a new hour. For example, I saw all history syncers except for one have slow INSERT INTO TRENDS statements, that reached the lock_wait timeout at approximately 20250726:150230. The queries were issued again and on the second attempt they took 105 seconds. So there was one history syncer that had a lock for about 400 seconds. This is not horrible because I increased the parallelism of the code that pushes metrics to the Trappers by 3x. I also increased the history syncer count to 8. On average, the average NVPS is probably somewhere around 58000 and encompasses several different hours of metrics.
I also have it going at 58000 NVPS for 3 hours continuously so far. This blows the estimate of 1000 NVPS per history syncer completely out of the water!! Impressive! And the 8 history syncers are only 45% utilized
I believe there is some code that executes hourly for trends calculations, yes? Is it possible that the 400 seconds of slow INSERTS that occurred at approximately at the 1500 hour are calling some other function that also needs to be fixed? During the 1600 hour the slow queries only lasted for 115 seconds. During 1700 hour they lasted for just over 300 seconds. This is still long, but significant improvement over what I've seen in the past since installing version 6.0. Sometimes, these long waits have lasted up to 6 hours.
I will stay around for another hour and turn on profiling of the history syncer that appears to hold the lock and see if I can find anything worthwhile, pull a GDB stack trace, etc. |
Comment by Ryan Eberly [ 2025 Jul 26 ] |
Based on the timing, it appears that at the 55th minute the hourly trends sync happens. The other box with the nVME and faster RAM only had slow queries for between 50 seconds and 140 seconds across the 1500-1700 hours |
Comment by Ryan Eberly [ 2025 Jul 26 ] |
At the 55-56 minute of the 8 history syncers, 7 had "INSERT INTO TRENDS" command waiting in mysql. The other history syncer, in top, was 100% utilized. I strace'd that syncer and saw that it was executing a bunch of these commands: "select itemid,num,value_m" "select itemid from trends" "insert into trends (item" Unlike before applying the patches the strace output was streaming info to the screen whereas before it was just stuck. This event lasted 230 seconds for the 1800 hour. I had a prof_enable on the history syncer. I'll have to transcribe it since this system is disconnected. If there is something particular from the profiling I can provide instead of having to type the whole thing, let me know.
|
Comment by Vladislavs Sokurenko [ 2025 Jul 28 ] |
It happens when trend has not received any values for more than an hour, so it's still flushed on 55 minute, do you know if its possible that many items don't receive values regularly ? Is it possible that those are some new items created ? In this case ZBX-24311-7-2-avoid_select_for_new_items.diff |
Comment by Ryan Eberly [ 2025 Jul 28 ] |
vso - Most metrics are produced on 10 minute intervals. There are some that are hourly, some that are every 30 minutes, and some that have discard unchanged with heartbeat logic as well, but most items are 10 minute intervals. It might be possible during these backlog scenarios that no values have been received, or the data comes in out of time order - those are the most common scenarios I've seen. It's difficult to evaluate everything though since there are 8,500,000 items total for this instance. I can try both patches. Should I apply one patch on Box 1 and the other on Box 2? Or just merge both patches together and apply to one of the machines while leaving the other machine unpatched, for comparison? |
Comment by Vladislavs Sokurenko [ 2025 Jul 28 ] |
It can be important to see if in your specific case parallel syncing or upserts will help alongside original fix: parallel_trend_flush_on_55_min_7_2.diff |
Comment by Ryan Eberly [ 2025 Jul 28 ] |
ok I'll add in the parallel trend flush patch. I have been running both servers with ZBX-25651-7-2-mysql.diff |
Comment by Ryan Eberly [ 2025 Jul 28 ] |
ok the parallel trend flush patch has been added. Is there anything specific you want me to watch for? Like I mentioned, now that the backlog is processed zabbix server is very healthy. |
Comment by Vladislavs Sokurenko [ 2025 Jul 28 ] |
Yes, you said that there are slow queries on 55th minute, for 230 seconds, would be nice to see if it's better now or worse, it depends on how old data is coming in. If it does not help then might need to think about spreading load in time, but we can finish with this ticket and created another one for further improvements. |
Comment by Ryan Eberly [ 2025 Jul 28 ] |
Yes...there were still a few long running INSERT commands because the first history syncer appears to lock the others at 55th minute. After the backlog was cleared we see 1 slow query per day during trends flush, but we don't see old data at the moment either. Lets give it 1 or 2 days since it appears that there are occasions where old data does come in and potentially cause the slow queries. I also agree that a future ticket would be fine for further improvement. The patches above have already greatly improved zabbix server performance, and it might take several days or a few weeks to further identify what other problems exist. Greatly appreciate the help! |
Comment by Ryan Eberly [ 2025 Jul 29 ] |
Nothing significant to report this morning with the parallel trend flush patch. I only patched 1 of the 2 boxes, to compare. Neither box has had a slow query during trend flush. We can give it one more day, or the remainder of the week to evaluate.
There has been no delay in metrics though, so the timestamps of the metrics are all very current. |
Comment by Ryan Eberly [ 2025 Jul 30 ] |
Still nothing new to report as of today. We had a few minor surges during the night due to some network issues on some of the systems sending metrics, but during those times no slow queries were observed on either of the 2 Zabbix servers. |
Comment by Ryan Eberly [ 2025 Jul 31 ] |
Still nothing new to report. I think vso is probably in agreement we can close this issue with the fixes already provided. If/When there is another occasion of a backlog or slow queries I'll just open another ticket. Thanks again for working through this with me. For instances primarily running custom code that push to Trappers this is a huge improvement for performance. |
Comment by Vladislavs Sokurenko [ 2025 Jul 31 ] |
It's good improvement for all users, thank you for your detailed report. |
Comment by Vladislavs Sokurenko [ 2025 Aug 07 ] |
Fixed in:
|