[ZBX-18071] 'Latest data' timeout via Monitoring > Hosts Created: 2020 Jul 14 Updated: 2024 Apr 10 Resolved: 2020 Aug 12 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Frontend (F) |
Affects Version/s: | 5.0.2 |
Fix Version/s: | 5.0.3rc1, 5.2.0alpha1, 5.2 (plan) |
Type: | Problem report | Priority: | Trivial |
Reporter: | Frederic Cremer | Assignee: | Andrejs Verza |
Resolution: | Fixed | Votes: | 16 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
All servers run on Ubuntu 18.04.4 LTS, all Zabbix components upgraded from 4.4.7 to 5.0.2
|
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
||||||||||||||||||||||||||||||||||||
Issue Links: |
|
||||||||||||||||||||||||||||||||||||
Team: | |||||||||||||||||||||||||||||||||||||
Sprint: | Sprint 66 (Jul 2020), Sprint 67 (Aug 2020) | ||||||||||||||||||||||||||||||||||||
Story Points: | 1 |
Description |
Problem 'Latest data' view does not load after upgrade from 4.4.7 to 5.0.2 with database server load going from 0.03 (only writes from Zabbix server) to 6.00+ after clicking 'latest data' link - seems to be long-running SELECT query on database After some troubleshooting, seems to be caused by 2 things:
Steps below are taken after setting search filter limit to 100 and setting query timeout on database server to 60 seconds Steps to reproduce:
Result: Database query timeout when clicking 'latest data' in Hosts overview: Example of failing query:
Other observation:
Expected:
|
Comments |
Comment by Frederic Cremer [ 2020 Jul 15 ] | ||||||||
Hi, did some further testing today and found that it has nothing to do with 'filter_hostids[0]', but everything with 'filter_show_without_data=1':
"filter_show_without_data=" and "filter_show_without_data=0" are not accepted by Zabbix. | ||||||||
Comment by dimir [ 2020 Jul 15 ] | ||||||||
For cross-reference, this is coming from https://www.zabbix.com/forum/zabbix-troubleshooting-and-problems/405215-latest-data-broken-too-slow-since-upgrade-to-5-0-2 | ||||||||
Comment by Jerome [ 2020 Jul 15 ] | ||||||||
Hello, We did an update from Zabbix 5.0.1 to 5.0.2 and we have this problem too. We use the specs Ubuntu 16.04, PostgreSQL and Nginx | ||||||||
Comment by Colin Small [ 2020 Jul 15 ] | ||||||||
I did an upgrade from Zabbix 5.0.1 to 5.0.2 and I also have this problem.
I am also using a zabbix server with similar specs to the one reported above. nginx,postgresql,Debian | ||||||||
Comment by Dawid [ 2020 Jul 15 ] | ||||||||
We have this issue too after upgrade from 5.0.1 to 5.0.2. | ||||||||
Comment by Colin Small [ 2020 Jul 15 ] | ||||||||
I'm tempted to revert to a snapshot prior to the upgrade to 5.0.2 and wait for this to get sorted out. I'm new to zabbix and I don't have a feeling for how quickly fixes like this are issued. Is it OK to have 5.0.2 agents reporting to a 5.0.1 server if I revert to the snapshot? | ||||||||
Comment by dimir [ 2020 Jul 15 ] | ||||||||
Absolutely. | ||||||||
Comment by Colin Small [ 2020 Jul 15 ] | ||||||||
Thank you. I reverted back to 5.0.1 for now and I will follow this bug. | ||||||||
Comment by Andrejs Verza [ 2020 Jul 16 ] | ||||||||
Resolved in development branch feature/ZBX-18071-5.0 | ||||||||
Comment by Vjacheslav Ryzhevskiy [ 2020 Jul 16 ] | ||||||||
Thank you! Works better. Noticed that collapsing elements groups is missing | ||||||||
Comment by Andrejs Verza [ 2020 Jul 16 ] | ||||||||
Thank you for your feedback, vryzhevsky. Glad to hear it's now working better. Collapsing elements were removed within | ||||||||
Comment by Vjacheslav Ryzhevskiy [ 2020 Jul 16 ] | ||||||||
IMHO grouping is very important functionality, when you have hundreds or thousands items in groups | ||||||||
Comment by dimir [ 2020 Jul 21 ] | ||||||||
Good news, grouping by application is coming back in | ||||||||
Comment by Andrejs Verza [ 2020 Jul 22 ] | ||||||||
Fixed in: | ||||||||
Comment by Evren Yurtesen [ 2020 Jul 27 ] | ||||||||
Are you sure this fixes the issue? I applied the fix from the commit to CHistoryManager.php and PostgreSQL seem to execute the correct statement but it still takes several minutes to execute. It was working fine before upgrading to 5.0.2 | ||||||||
Comment by Andrejs Verza [ 2020 Jul 27 ] | ||||||||
Thank you for your feedback, yurtesen. Could you please turn on the debug mode (Administration => User groups => Debug mode) and make sure the mentioned SQL is the one taking that much time to execute? | ||||||||
Comment by Evren Yurtesen [ 2020 Jul 27 ] | ||||||||
averza well I am pretty sure because 1- I can see it hanging in postgresql query list in pg_stat_activity 2- I tried to execute it mysel on zabbix database and it took several minutes to execute. I am not able to use debug mode because nginx is giving gateway timeout after several minutes when I try to load latest data. | ||||||||
Comment by Andrejs Verza [ 2020 Jul 27 ] | ||||||||
yurtesen , how many records do you typically have in the history_uint table? Also - could you please execute the same SQL query you mentioned directly in the database with the EXPLAIN keyword before the query? What does it tell? Thank you. | ||||||||
Comment by Iaroslav Dronskii [ 2020 Jul 27 ] | ||||||||
The query is still extremely slow. We have quite a few hosts with 1000+ items. DB: postgresql 11 + timescaledb 1.7.2.
A small example: Query:
// zabbix=# explain select itemid from history_uint where (itemid in (72864,80600,80601,80602,80603,80604,80605,80606,80607,80608,80609,80610,80611,80612,80614,80617,80619,80620,80622,80623,80625,80628,80630,80631,80632,80634,80635,80636,80637,80640,80641,80643,80644,80646,80651,80652,80653,8065480655,80657,80658,80659,80660,80661,80662,80667,80668,80670,80673,80675,80676,80677,80678,80679,80680,80681,80683,80684,80686,80688,80689,80690,80691,80692,80693,80694,80696,80699,80701,80702,80704,80705,80707,80710,80712,80713,80714,80716,80717,80718,80719,80722,80723,80725,80726,80728,80733,80734,80735,80736,8077,80739,80740,80741,80742,80743,80744,80749,80750,80752,80755,80757,80758,80759,80760,80761,80762,80763,80765,80766,80768,80770,80771,80772,80773,80774,80775,80776,80777,80778,80779,80780,80781,80782,80783,80784,80785,80786,80787,80788,80789,80790,80791,80792,80793,80794,80795,80796,80797,80798,80799,80800,80801,183765,183767,183768,183774,183775,183776,183782,183783,183784,183786,183787,183788,183790,183799,183800,183801) and clock > 1595770706) group by itemid;
Query plan:
// Group (cost=1.15..245671.07 rows=200 width=8)
Group Key: _hyper_2_1464_chunk.itemid
-> Merge Append (cost=1.15..245105.30 rows=226306 width=8)
Sort Key: _hyper_2_1464_chunk.itemid
-> Index Only Scan using _hyper_2_1464_chunk_history_uint_1 on _hyper_2_1464_chunk (cost=0.57..139175.09 rows=131971 width=8)
Index Cond: ((itemid = ANY ('{72864,80600,80601,80602,80603,80604,80605,80606,80607,80608,80609,80610,80611,80612,80614,80617,80619,80620,80622
,80623,80625,80628,80630,80631,80632,80634,80635,80636,80637,80640,80641,80643,80644,80646,80651,80652,80653,8065480655,80657,80658,80659,80660,80661,80662,80
667,80668,80670,80673,80675,80676,80677,80678,80679,80680,80681,80683,80684,80686,80688,80689,80690,80691,80692,80693,80694,80696,80699,80701,80702,80704,8070
5,80707,80710,80712,80713,80714,80716,80717,80718,80719,80722,80723,80725,80726,80728,80733,80734,80735,80736,8077,80739,80740,80741,80742,80743,80744,80749,8
0750,80752,80755,80757,80758,80759,80760,80761,80762,80763,80765,80766,80768,80770,80771,80772,80773,80774,80775,80776,80777,80778,80779,80780,80781,80782,807
83,80784,80785,80786,80787,80788,80789,80790,80791,80792,80793,80794,80795,80796,80797,80798,80799,80800,80801,183765,183767,183768,183774,183775,183776,18378
2,183783,183784,183786,183787,183788,183790,183799,183800,183801}'::bigint[])) AND (clock > 1595770706))
-> Index Only Scan using _hyper_2_1459_chunk_history_uint_1 on _hyper_2_1459_chunk (cost=0.57..103667.14 rows=94335 width=8)
Index Cond: ((itemid = ANY ('{72864,80600,80601,80602,80603,80604,80605,80606,80607,80608,80609,80610,80611,80612,80614,80617,80619,80620,80622
,80623,80625,80628,80630,80631,80632,80634,80635,80636,80637,80640,80641,80643,80644,80646,80651,80652,80653,8065480655,80657,80658,80659,80660,80661,80662,80
667,80668,80670,80673,80675,80676,80677,80678,80679,80680,80681,80683,80684,80686,80688,80689,80690,80691,80692,80693,80694,80696,80699,80701,80702,80704,8070
5,80707,80710,80712,80713,80714,80716,80717,80718,80719,80722,80723,80725,80726,80728,80733,80734,80735,80736,8077,80739,80740,80741,80742,80743,80744,80749,8
0750,80752,80755,80757,80758,80759,80760,80761,80762,80763,80765,80766,80768,80770,80771,80772,80773,80774,80775,80776,80777,80778,80779,80780,80781,80782,807
83,80784,80785,80786,80787,80788,80789,80790,80791,80792,80793,80794,80795,80796,80797,80798,80799,80800,80801,183765,183767,183768,183774,183775,183776,18378
2,183783,183784,183786,183787,183788,183790,183799,183800,183801}'::bigint[])) AND (clock > 1595770706))
(8 rows)
Yesterday's chunk has 251 million rows in history_uint:
// zabbix=# select count(*) from _timescaledb_internal._hyper_2_1459_chunk;
count
-----------
251618040
(1 row)
The query scans indexes for two last chunks (today + yesterday). Opening the 4301 items host is impossible; The query will hang for over 5 minutes (at which php execution time limits kick in). The same thing should happen on any db with any kind of partitioning. | ||||||||
Comment by Evren Yurtesen [ 2020 Jul 27 ] | ||||||||
averza I have psql 12.2 with timescaledb 1.7.2 `history_uint` table has 164614857 rows.... not sure if it is more than usual or not...? With explain, I get something like: Group (cost=207040303.57..207061521.88 rows=200 width=8) Group Key: _hyper_2_671_chunk.itemid -> Gather Merge (cost=207040303.57..207061521.38 rows=200 width=8) Workers Planned: 1 -> Group (cost=207039303.56..207060498.87 rows=200 width=8) Group Key: _hyper_2_671_chunk.itemid -> Sort (cost=207039303.56..207049901.22 rows=4239063 width=8) Sort Key: _hyper_2_671_chunk.itemid -> Parallel Append (cost=108530.70..206504790.14 rows=4239063 width=8) -> Parallel Bitmap Heap Scan on _hyper_2_671_chunk (cost=108530.70..124960673.58 rows=2659992 width=8) Recheck Cond: So... can't this be reverted back? It was working very nicely before 5.0.2... | ||||||||
Comment by James Kirsop [ 2020 Jul 28 ] | ||||||||
Here's a copy of an explain of the query we're discussing on my system. It seems to take about 10 seconds to execute, so not terribly long (and not nearly as long as some of the other slow Zabbix queries I'm experiencing in 5.x):
MariaDB [zabbix_prod]> EXPLAIN SELECT itemid FROM history WHERE itemid IN (170927,170928,170975,172834,172835,172836,172837,172838,172839,172840,172841,172842,172843,172844,172845,172906,172907,172908,172909,172910,172911,172912,172913,172914,172915,172916,172917,207614,207615,207626,207627) GROUP BY itemid HAVING MAX(clock)>1595803513; | ||||||||
Comment by Andrejs Verza [ 2020 Jul 28 ] | ||||||||
Hi, jameskirsop. This has been fixed within this branch. You can either wait for the next 5.0.3rc1 / 5.2.0alpha1 release or just apply the commits mentioned in "fixed in" comment. | ||||||||
Comment by Andrejs Verza [ 2020 Jul 28 ] | ||||||||
yurtesen, thank you for the provided information. Could you please also check if the performance issue persists when "Show items without data" checkbox is checked. | ||||||||
Comment by Evren Yurtesen [ 2020 Jul 28 ] | ||||||||
averza again, are you sure it is fixed? as I mentioned it does not seem to help? You know it was NOT easy to check the show items checkbox because the page never loaded. I had to give it as URL parameter. But then PHP ran out of memory (default was 128M) this never happened before 5.0.2. I increased the memory limit and loaded latest items. It was still working somewhat slow with the "Show items without data" checkbox checked. In the past it opened much faster. Then I tried the debug mode... it froze my browser for a while... not surprised why php ran out of memory... ******************** Script profiler ******************** Total time: 1.846662 Total SQL time: 1.067919 SQL count: 1031 (selects: 1028 | executes: 13) Peak memory usage: 167M Memory limit: 256M Do you want me to test more things? Because I am thinking of downgrading to an older version right now.
| ||||||||
Comment by Andrejs Verza [ 2020 Jul 29 ] | ||||||||
Hi, yurtesen. I must agree that the previous versions worked faster, but there were issues (like missing data, mentioned in | ||||||||
Comment by Evren Yurtesen [ 2020 Jul 29 ] | ||||||||
averza in issue I do not claim to know the best way to resolve this issue. But I would rather have performance over feature where possible. Nowadays people are used to fast response times. You will end up people choosing solution X which is more responsive than Zabbix, even if that solution X has a fraction of the features that Zabbix provide. Maybe unreasonable but that is how people are Anyway, hopefully this issue gets fixed... | ||||||||
Comment by dimir [ 2020 Jul 29 ] | ||||||||
There are many issues like that and all of them are because of the need for frontend to get data from history tables, which is usually really slow. Before 2.2 Zabbix wasn't using history tables to view "Latest data". The latest values of each item were stored in items table. This allowed quick view of latest data but it caused big issues when handling configuration internally. Thus, in Zabbix 2.2.0 these columns were removed from items table. Since then, "Latest data" is fetched from history tables and that proved to be slow and resource consuming on non-small installations (e. g. see One of the solutions to fix this issue once and for all, to introduce special tables (lastvalue_*), is explained here. And it actually was tested and proved working on one big setup. | ||||||||
Comment by Evren Yurtesen [ 2020 Jul 31 ] | ||||||||
dimir the solution you suggest is interesting and seems helpful but it sort of duplicates some data... For this specific case the biggest problem is when zabbix is trying to find non-empty items. It executes something like select itemid from history_uint where (itemid in (72864,80600,...MANY ITEMIDS...)and clock > 1595770706) group by itemid; It does not even fetch the data for these items... and it takes long time because of group by on postgresql,, perhaps mysql works faster? maybe related to this... https://wiki.postgresql.org/wiki/Loose_indexscan I hope this gets somewhat fixed in 5.0.3 at least... | ||||||||
Comment by dimir [ 2020 Aug 05 ] | ||||||||
yurtesen, for the record we used to have this duplication before (in the items table) and there was no such issue. As to non-empty items, specifically for this case the solution was proposed. Latest data page does not care about any other than the latest value. The biggest problem in that query you mention is the amount of records that need to be processed. In the proposed solution there is exactly 0 or 1 record affected, thus, the results are returned instantly. | ||||||||
Comment by Andrejs Verza [ 2020 Aug 06 ] | ||||||||
A patch for Zabbix version 5.0.2 is available: patch_for_5.0.2.patch git apply patch_for_5.0.2.patch | ||||||||
Comment by Stefano Enrico Mendola [ 2020 Aug 06 ] | ||||||||
Hi, I've just applied the patch. However, it seems like, by navigating the sidebar, clicking Monitoring > Latest Data tries to load the page but fails with an error 500. Here's the error I'm getting in the apache error_log.
[Thu Aug 06 11:30:04.504373 2020] [proxy_fcgi:error] [pid 51080] [client 192.168.142.214:62151] AH01071: Got error 'PHP message: PHP Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 20480 bytes) in /usr/share/zabbix/include/classes/api/CApiService.php on line 288\nPHP message: PHP Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 20480 bytes) in /usr/share/zabbix/include/func.inc.php on line 1071\n'
We tried to increase the PHP memory from 256M to 1G in the Zabbix GUI configuration, but this only gave us a 503 instead of a 500. | ||||||||
Comment by Andrejs Verza [ 2020 Aug 06 ] | ||||||||
Hi, stefano.mendola. HTTP code 503 is possibly related to wrong server configuration. Are you sure the web server is up and running after you updated the configuration? | ||||||||
Comment by Stefano Enrico Mendola [ 2020 Aug 06 ] | ||||||||
Hi Andrejs, yes, the web server was up and running. What we exactly experienced and did was:
In this whole process Apache was up and running. | ||||||||
Comment by Clemens Schwaighofer [ 2020 Aug 07 ] | ||||||||
The patch won't fix it because I tried that before I even found the bug. The 503 error is fast cgi/proxy timeout to PHP-FPM and this happens because the quries (several) are taking longer than the PHP timeout. I have set PHP execution timeout and proxy timeout and http connection timeout up to 15min and it still is too slow. >15 to get last data. It is not one query, it seems to run different queries in succession, so like a endless loop | ||||||||
Comment by Sebastian [ 2020 Aug 07 ] | ||||||||
Hello, Can you rollback "latest data" functions to work the same like in 5.0.1? | ||||||||
Comment by Andrejs Verza [ 2020 Aug 07 ] | ||||||||
Hello, Kendzens! Unfortunately we cannot revert "latest data" to 5.0.1 since there were issues with incomplete data selection. Instead we are aiming to achieve both correctness of the displayed data and good performance as well. Could you please try patch_for_5.0.2_v2.patch Note that the previous patch must be uninstalled first. git apply -R patch_for_5.0.2.patch git apply patch_for_5.0.2_v2.patch | ||||||||
Comment by Cory White [ 2020 Aug 07 ] | ||||||||
Hello Andrejs - since closing | ||||||||
Comment by Andrejs Verza [ 2020 Aug 07 ] | ||||||||
Hi, Xponet. PHP memory issue will hopefully be resolved, if you apply the patch. | ||||||||
Comment by Andrejs Verza [ 2020 Aug 07 ] | ||||||||
A patch v2.1 is available for testing. Note that the previous patches must be uninstalled first. git apply -R patch_for_5.0.2_v2.patch git apply -R patch_for_5.0.2.patch git apply patch_for_5.0.2_v2.1.patch | ||||||||
Comment by Cory White [ 2020 Aug 07 ] | ||||||||
Don't see v2.1 patch, also this applies to mariadb backend as well? | ||||||||
Comment by Stefano Enrico Mendola [ 2020 Aug 07 ] | ||||||||
I have applied the patch_v2, and I was at least able to see the Monitoring > Latest data on itself, but as soon as I filtered for an host with roughly 3000 items, the query is still slow and ends up with a 503. Still, ticking the "Show items without data" has the effect of making the query 700% faster. Then I applied the patch_v2.1 and I'm again unable to load the Monitoring > Latest data page on itself. Slow query that ends up with a 503, same as we experienced before the patch_v2. | ||||||||
Comment by Andrejs Verza [ 2020 Aug 07 ] | ||||||||
Hi stefano.mendola. Thank you for your feedback. As you have already noticed, Latest data is working good when "Showing items without data" (the default state). However, filtering data with values is quite a challenge on hosts with thousands of items, taking into account that some stats are needed to render the list. Nevertheless, the patch v3 will address this exact issue. Please stay tuned. | ||||||||
Comment by Andrejs Verza [ 2020 Aug 07 ] | ||||||||
Xponet, yes, the patch is independent from the database type. | ||||||||
Comment by Cory White [ 2020 Aug 07 ] | ||||||||
I applied the patch to our scenario - it does fix the HTTP 500 error - but again the expected 'clear' state (blank page) for all filters on Latest Data is showing all hosts 'based on limit of 1000 returns" from global settings. I even reverted back to global 5000 rows, page loads but not blank (as expected from previous versions) when hitting reset. | ||||||||
Comment by Andrejs Verza [ 2020 Aug 08 ] | ||||||||
Hi, Xponet. Glad to hear the page is now loading even with the increased number of rows. Please note that the requirement to setup some filtering prior to displaying the data was removed in version 5.0. | ||||||||
Comment by Cory White [ 2020 Aug 10 ] | ||||||||
Hello - I understand the new 'work flow of 5.0 to Monitoring -> Hosts -> Latest Data for that host. But to have to apply filters on Latest Data directly or load all hosts seems 'backwards' to logic, especially to existing users upgrading. If there is underlying code for this change then it seems removing 'Reset' from the page would force this logic for all users. Though makes more sense to revert to 'clearing' filters if possible, especially sense this loading all hosts was causing our HTTP error? My 2 cents but thank you for the quick patch/fix....... | ||||||||
Comment by Marco Hofmann [ 2020 Aug 10 ] | ||||||||
+1 to Xponet | ||||||||
Comment by Andrejs Verza [ 2020 Aug 10 ] | ||||||||
Hi! Imagine a simple setup where there are just several hosts with several items. In such situation what would the reason be not to show all the latest data grouped by hosts and applications? | ||||||||
Comment by Stefano Enrico Mendola [ 2020 Aug 10 ] | ||||||||
I think there should be the possibility to configure this option (maybe under Administration > General). | ||||||||
Comment by Cory White [ 2020 Aug 10 ] | ||||||||
Adding it as an option/preference would be nice middle ground/workaround. Even adding logic to load only active hosts? As our deployment is static and dynamic so not all hosts are monitored all the time (as our deployment is nearly 7000 total hosts). 'Reset' inherently implies clear and start fresh, regardless of deployment size. As Marco stated, its intuitive for clearing filters, especially being an existing user(s). | ||||||||
Comment by Marco Hofmann [ 2020 Aug 10 ] | ||||||||
Hi averza! You argument is correct and valid. But Zabbix has made clear in the past, with several designs decisions, that they design Zabbix Frontend for large systems in mind. | ||||||||
Comment by dimir [ 2020 Aug 11 ] | ||||||||
Showing everything sounded always weird to me too. IMO this would work in 2 cases:
| ||||||||
Comment by dimir [ 2020 Aug 11 ] | ||||||||
As averza mentioned, it all would be much easier if we'd know which items have recently received new values. We have table item_rtdata that is used to store log monitoring related data: select i.key_,r.* from item_rtdata r,items i where r.itemid=i.itemid order by i.key_ limit 10; +-----------------+--------+-------------+-------+-------+-------+ | key_ | itemid | lastlogsize | state | mtime | error | +-----------------+--------+-------------+-------+-------+-------+ | agent.hostname | 30154 | 0 | 0 | 0 | | | agent.hostname | 23327 | 0 | 0 | 0 | | | agent.hostname | 30219 | 0 | 0 | 0 | | | agent.ping | 30220 | 0 | 0 | 0 | | | agent.ping | 30155 | 0 | 0 | 0 | | | agent.ping | 23287 | 0 | 0 | 0 | | | agent.version | 30221 | 0 | 0 | 0 | | | agent.version | 30156 | 0 | 0 | 0 | | | agent.version | 23288 | 0 | 0 | 0 | | | kernel.maxfiles | 30222 | 0 | 0 | 0 | | +-----------------+--------+-------------+-------+-------+-------+ 10 rows in set (0.000 sec) That would be extended to have additional column e. g. lastclock, that would contain the timestamp of the latest value received and would be updated by history syncer. Showing latest data in this case would be much less resource consuming: select i.key_,r.* from item_rtdata r,items i where r.itemid=i.itemid order by i.key_ limit 10; +-----------------+--------+-------------+-------+-------+-------+------------+ | key_ | itemid | lastlogsize | state | mtime | error | lastclock | +-----------------+--------+-------------+-------+-------+-------+------------+ | agent.hostname | 30154 | 0 | 0 | 0 | | 1597142926 | | agent.hostname | 23327 | 0 | 0 | 0 | | 1597142926 | | agent.hostname | 30219 | 0 | 0 | 0 | | 1597142926 | | agent.ping | 30220 | 0 | 0 | 0 | | 1597142926 | | agent.ping | 30155 | 0 | 0 | 0 | | 1597142926 | | agent.ping | 23287 | 0 | 0 | 0 | | 1597142926 | | agent.version | 30221 | 0 | 0 | 0 | | 1597142926 | | agent.version | 30156 | 0 | 0 | 0 | | 1597142926 | | agent.version | 23288 | 0 | 0 | 0 | | 1597142926 | | kernel.maxfiles | 30222 | 0 | 0 | 0 | | 1597142926 | +-----------------+--------+-------------+-------+-------+-------+------------+ 10 rows in set (0.000 sec) | ||||||||
Comment by Velko Ivanov [ 2020 Aug 11 ] | ||||||||
IMHO This query is slow because of the WHERE itemid IN (....) clause, not the GROUP BY, or anything else.
This WHERE clause results in a full index scan - in the worst case it will read the whole index in and check each row and that's what the EXPLAIN output tells us (looking at the data by Yaroslav Dronskii): Index Only Scan using _hyper_2_1464_chunk_history_uint_1 on _hyper_2_1464_chunk (cost=0.57..139175.09 rows=131971 width=8) cost=0.57..139175.09 / rows 131971 means that it will do anywhere between a few reads - if all specified itemids happen to be in the first pages - to many many reads - which would be all the rows of the table if the itemids are at the end - and it expects 140k rows as a result of the scan. It is only the index - 8 bytes per row - but at 250M that the table has, it makes about 2G read from the disk and if itemids are let's say 50, then 250M * 50 basic IF comparisons in a loop.
Since we are looking for latest items, depending on how the itemid is generated we might always be looking at items towards the end of the index and thus always scanning the full index and it won't matter what else we do - as long as the WHERE '.dbConditionInt('itemid', $type_itemids). is in there, which it is in all the posted patches, the problem won't be solved.
One thing that can help is reversing the index to (itemid DESC, clock) - this way, if I'm right, the index scan will always start close to the items we are looking for and the query will be quick. This might cause other problems though - like slowing down writing to the table. Another thing would be to add an index by clock (since the currently existing (itemid, clock) index can not be used this way) and pre-select the rows that are older than what we are looking for and then filter by itemid within these. Something like
SELECT itemid FROM ( SELECT itemid FROM history_X WHERE clock > N ) WHERE itemid IN (.....) GROUP BY itemid
| ||||||||
Comment by Andrejs Verza [ 2020 Aug 11 ] | ||||||||
Hi, vivanov. Thank you for your investigation and explanation. Unfortunately the provided solution did not add any performance gain on MySQL/PostgreSQL. At this point the ticket is going to be merged and closed soon, but the resolution of the issue will continue on larger scale by improving database structure. | ||||||||
Comment by Marco Hofmann [ 2020 Aug 11 ] | ||||||||
Is there/ will there be another ZBX article, where the community can watch the ongoing development and discussion, regarding this bigger than expected problem averza? | ||||||||
Comment by Andrejs Verza [ 2020 Aug 11 ] | ||||||||
Fixed in: | ||||||||
Comment by Velko Ivanov [ 2020 Aug 12 ] | ||||||||
Here is a bit more thorough investigation, focusing on my second suggestion from above: I recorded a latest-data query from my DB (PostgreSQL) that was running slowly: EXPLAIN SELECT itemid FROM history_uint WHERE itemid IN (...260 itemids...) GROUP BY itemid HAVING MAX(clock)>1597045676; GroupAggregate (cost=0.56..3228852.84 rows=851 width=8) Group Key: itemid Filter: (max(clock) > 1597045676) -> Index Only Scan using history_uint_1 on history_uint (cost=0.56..3206559.20 rows=4452349 width=12) Index Cond: (itemid = ANY ('{...260 itemids...}'::bigint[])) JIT: Functions: 5 Options: Inlining true, Optimization true, Expressions true, Deforming true Cost = 0.56 - 3'228'852.84
Compared to what I suggest: EXPLAIN SELECT itemid FROM ( SELECT itemid FROM history_uint WHERE clock>1597045676 ) T WHERE itemid IN (...260 itemids...); Index Only Scan using history_uint_1 on history_uint (cost=0.56..753747.83 rows=581576 width=8) Index Cond: ((itemid = ANY ('{...260 itemids...}'::bigint[])) AND (clock > 1597045676)) JIT: Functions: 2 Options: Inlining true, Optimization true, Expressions true, Deforming true Cost = 0.56 - 753'747.83
I added an index on the clock column: CREATE INDEX ON history_uint(clock); VACUUM ANALYZE history_uint; And did the same queries and Postgre decided to ignore the new index, I tried with CREATE INDEX ON history_uint(clock, itemid); but it didn't use either.
The VACUUM ANALYZE changed the predictions though, so here is what I get at the end: Original query: EXPLAIN SELECT itemid FROM history_uint WHERE itemid IN (...260 itemids...) GROUP BY itemid HAVING MAX(clock)>1597045676; Finalize GroupAggregate (cost=1000.58..202475.31 rows=853 width=8) Group Key: itemid Filter: (max(clock) > 1597045676) -> Gather Merge (cost=1000.58..202430.53 rows=2559 width=12) Workers Planned: 1 -> Partial GroupAggregate (cost=0.56..201142.63 rows=2559 width=12) Group Key: itemid -> Parallel Index Only Scan using history_uint_1 on history_uint (cost=0.56..187923.95 rows=2638619 width=12) Index Cond: (itemid = ANY ('{...260 itemids...}'::bigint[])) JIT: Functions: 8 Options: Inlining false, Optimization false, Expressions true, Deforming true Suggested query: EXPLAIN SELECT itemid FROM ( SELECT itemid FROM history_uint WHERE clock>1597045676 ) T WHERE itemid IN (...260 itemids...); Index Only Scan using history_uint_1 on history_uint (cost=0.56..32292.77 rows=641860 width=8) Index Cond: ((itemid = ANY ('{...260 itemids...}'::bigint[])) AND (clock > 1597045676)) Suggested, with grouping: EXPLAIN SELECT itemid FROM ( SELECT itemid FROM history_uint WHERE clock>1597045676 ) T WHERE itemid IN (...260 itemids...) GROUP BY itemid; Group (cost=0.56..33897.42 rows=2559 width=8) Group Key: history_uint.itemid -> Index Only Scan using history_uint_1 on history_uint (cost=0.56..32292.77 rows=641860 width=8) Index Cond: ((itemid = ANY ('{...260 itemids...}'::bigint[])) AND (clock > 1597045676)) Suggested, with DISTINCT: EXPLAIN SELECT DISTINCT itemid FROM ( SELECT itemid FROM history_uint WHERE clock>1597045676 ) T WHERE itemid IN (...260 itemids...); Unique (cost=0.56..33897.42 rows=2559 width=8) -> Index Only Scan using history_uint_1 on history_uint (cost=0.56..32292.77 rows=641860 width=8) Index Cond: ((itemid = ANY ('{...260 itemids...}'::bigint[])) AND (clock > 1597045676))
All the suggested variants have Cost = 0.56 - 33'897.42 (or 32'292.77) The original - Cost = 1'000.58 - 202'475.31
This is a clear improvement both in complexity and cost and it doesn't involve creating a new index. Although, to be honest, the greatest change in predicted cost came from simply vacuuming the table.
And finally the actual run times of the queries: Suggested (with DISTINCT): 4127.091 ms (00:04.127) Original: 8841.342 ms (00:08.841) Both returning 189 rows.
After being cached by the DB: Suggested (with DISTINCT): 150.435 ms Original: 883.536 ms
Based on that I would recommend to use this variant of the query: SELECT DISTINCT itemid FROM ( SELECT itemid FROM history_uint WHERE clock>1597045676 ) T WHERE itemid IN (...);
It seems to improve not by lowering the total row count, but by eliminating the overhead of aggregation, which is not what I intended and it doesn't improve as much as we would like, I guess, but halving the execution time is still a gain. Indexing on history_X(clock, itemid) should also be tested more thoroughly, because the fact that it didn't get used in this case doesn't mean there wouldn't be a situation when it gets chosen by the query optimizer. Also what @dimir suggests above is reasonable and I expect it to be the real improvement that I was looking for initially.
| ||||||||
Comment by Andrejs Verza [ 2020 Aug 12 ] | ||||||||
Hi, vivanov. Please note that the mentioned, original SQL (that one with HAVING clause), has already been fixed in the very first commit. Please also note that history tables are optimized for writes, therefore adding new indexes is not an option. All other options (not considering DB structure change) include playing with SQL syntax which did not actually show any good results under heavy circumstances. | ||||||||
Comment by Velko Ivanov [ 2020 Aug 12 ] | ||||||||
@Andrejs Verza, That "playing with SQL syntax" is changing the structure of the query execution plan considerably.
There is a clear difference between Finalize GroupAggregate (cost=1000.58..195560.99 rows=1080 width=8) Group Key: itemid Filter: (max(clock) > 1597228657) -> Gather Merge (cost=1000.58..195504.27 rows=3241 width=12) Workers Planned: 1 -> Partial GroupAggregate (cost=0.56..194139.65 rows=3241 width=12) Group Key: itemid -> Parallel Index Only Scan using history_1 on history (cost=0.56..182670.91 rows=2287266 width=12) And Unique (cost=0.56..1376.43 rows=2608 width=8) -> Index Only Scan using history_1 on history (cost=0.56..1363.20 rows=5292 width=8) The run times improve a lot on my system, it is repeatable and it is especially noticeable when clock is a more recent value. Please try again with this query: SELECT DISTINCT itemid FROM ( SELECT itemid FROM history_uint WHERE clock > X ) T WHERE itemid IN (...);
You might have gotten bad results before, since the query in my first post wouldn't even return the correct rows - it returns 1000's where it should return a 100. This does return the correct result and improves run time.
| ||||||||
Comment by Andrejs Verza [ 2020 Aug 12 ] | ||||||||
vivanov, in your last comment you're referring to decomposition of SQL plan, in which there is this fragment: (max(clock) > 1597228657) Are you sure you're operating with the up-to-date data, since according to this commit, there is no more aggregation by max clock? | ||||||||
Comment by Velko Ivanov [ 2020 Aug 12 ] | ||||||||
No, I'm not on the latest patch. Judging by the query plans the DB generates, the change in this commit should do pretty much the same thing as the query that I'm suggesting. Alright, I'll check how it does after it gets released in 5.0.3 and if there is something to add I'll come back. | ||||||||
Comment by Stefano Enrico Mendola [ 2020 Aug 13 ] | ||||||||
Hi, I have applied the new patch, but I didn't notice any improvement. For the record, we have about 200 hosts with roughly 3K+ items each. We are still experiencing the same timeouts due to the slow queries | ||||||||
Comment by Rostislav Palivoda [ 2020 Aug 13 ] | ||||||||
patch_for_5.0.2_v2.2.patch We are considering another option - to make a change in data structure and application design to improve performance. | ||||||||
Comment by Sebastian [ 2020 Aug 16 ] | ||||||||
Hello, sorry but the latest patch is not working.
I cannot install it. I don't even have "ui" folder in my installation,
root@zabbix:~# git apply 112275_patch_for_5.0.2_v2.2.patch error: ui/app/controllers/CControllerLatest.php: Nie ma takiego pliku ani katalogu error: ui/app/controllers/CControllerLatestView.php: Nie ma takiego pliku ani katalogu error: ui/app/controllers/CControllerLatestViewRefresh.php: Nie ma takiego pliku ani katalogu error: ui/include/classes/api/managers/CHistoryManager.php: Nie ma takiego pliku ani katalogu When I try to find target files: root@zabbix:/etc/zabbix# find / -name CControllerLatest.php /usr/share/zabbix/app/controllers/CControllerLatest.php Could you please provide steps how to install patch properly? I am using Ubuntu 16 so "zabbix-frontend-php-deprecated" 5.0.2 package is used if it has any matter. Or if anyone can assis how to rollback 5.0.2 to 5.0.1 in ubuntu that would be helpful. apt-get install «pkg»=«version» shows only 5.0.2 version, and cannot install 5.0.1 with apt. I can rollback by restoring snapshot before 5.0.2 upgrade, but then losing collected data since then. | ||||||||
Comment by Andrejs Verza [ 2020 Aug 17 ] | ||||||||
Hi, Kendzens. Please apply patch_for_5.0.2_v2.1_no_ui_folder.patch Put the file inside /usr/share/zabbix directory and then git apply patch_for_5.0.2_v2.1_no_ui_folder.patch | ||||||||
Comment by Marco Hofmann [ 2020 Aug 17 ] | ||||||||
@ Kendzens You can manually download and install the DEB file? For example: | ||||||||
Comment by dimir [ 2020 Aug 17 ] | ||||||||
Guys who had performance issues, you are welcome to test changes in | ||||||||
Comment by Sebastian [ 2020 Aug 17 ] | ||||||||
Thank you Andrejs, Marco you are a lifesaver, I was looking for that deb direct link today with no luck. now you haelped me with this url. I have downgraded to 5.0.1 and I'm again happy user. Everything works smooth. | ||||||||
Comment by Mickael Martin [ 2020 Aug 26 ] | ||||||||
Seems not fixed in 5.0.3 : 5-8s to have the page : First host : The second host (same templates, sames items) : | ||||||||
Comment by Alex [ 2020 Sep 02 ] | ||||||||
The problem not resolved in 5.0.3 for PostgreSQL: Total time: 31.01048 Longest requests: SQL (0.431231): SELECT itemid FROM history_str WHERE itemid IN (79175,79177,79196,79201) AND clock>1598958317 GROUP BY itemid Explain Analyze (re-execution is fast): Group (cost=0.42..79.38 rows=78 width=8) (actual time=0.091..0.845 rows=4 loops=1)
SQL (26.423314): SELECT itemid FROM history_uint WHERE itemid IN (79176,79178,79179,79180,79181,79182,79183,79187,79197,79198,79200,79202,79203,79204,79205,79207,79418,79419,79424,79425,79426,79427,79432,79433,79434,79435,79436,79437,79438,79439) AND clock>1598958317 GROUP BY itemid Explain Analyze (re-execution is fast): Group (cost=0.57..61.77 rows=127 width=8) (actual time=0.079..90.334 rows=30 loops=1) SQL (3.98664): SELECT itemid FROM history WHERE itemid IN (79184,79185,79186,79188,79189,79190,79191,79192,79193,79194,79195,79199,79206,79420,79421,79422,79423,79428,79429,79430,79431) AND clock>1598958317 GROUP BY itemid Explain Analyze (re-execution is fast): Group (cost=0.57..15188.91 rows=5735 width=8) (actual time=0.081..165.820 rows=21 loops=1)
Another example: Probably cause: slow WHERE ... IN (...) | ||||||||
Comment by John [ 2020 Sep 14 ] | ||||||||
Good Morning I have the same problem with the version 5.0.3 1.
2.
3.
4.
SQL (0.031619): SQL (0.000306): SELECT md.moduleid,md.id,md.relative_path,md.config FROM module md WHERE md.status=1 ORDER BY md.relative_path SQL (0.000221): SELECT NULL FROM users u WHERE u.userid=64 FOR UPDATE SQL (0.000173): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.sidebar.mode' SQL (0.000688): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.hostsmon.sort' SQL (0.000174): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.hostsmon.sortorder' SQL (0.000157): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.hostsmon.filter.active' SQL (0.000123): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.hostsmon.filter.tags.tag' SQL (0.000165): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.hostsmon.filter.name' SQL (0.000141): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.hostsmon.filter.groupids' SQL (0.000139): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.hostsmon.filter.ip' SQL (0.000126): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.hostsmon.filter.dns' SQL (0.000116): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.hostsmon.filter.port' SQL (0.000142): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.hostsmon.filter.status' SQL (0.000143): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.hostsmon.filter.evaltype' SQL (0.00014): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.hostsmon.filter.severities' SQL (0.000147): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.hostsmon.filter.show_suppressed' SQL (0.000154): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.hostsmon.filter.maintenance_status' SQL (0.00026): SELECT c.* FROM config c SQL (6.7E-5): BEGIN SQL (0.001038): SELECT h.hostid,h.name,h.status FROM hosts h WHERE h.flags IN (0,4) AND h.status IN (0,1) AND ((UPPER(h.name) LIKE '%MTFLOSRV004%' ESCAPE '!')) ORDER BY h.name LIMIT 10001 SQL (5.3E-5): COMMIT SQL (5.2E-5): BEGIN SQL (0.000197): SELECT h.hostid,h.name,h.status,h.maintenance_status,h.maintenanceid,h.maintenance_type,h.available,h.snmp_available,h.jmx_available,h.ipmi_available,h.error,h.ipmi_error,h.snmp_error,h.jmx_error FROM hosts h WHERE h.flags IN (0,4) AND h.hostid=11205 AND h.status IN (0,1) SQL (0.006594): SELECT DISTINCT COUNT(DISTINCT g.graphid) AS rowscount,i.hostid FROM graphs g,graphs_items gi,items i WHERE i.hostid=11205 AND gi.graphid=g.graphid AND i.itemid=gi.itemid AND g.flags IN (0,4) GROUP BY i.hostid SQL (0.004319): SELECT COUNT(DISTINCT ht.httptestid) AS rowscount,ht.hostid FROM httptest ht WHERE ht.hostid=11205 GROUP BY ht.hostid SQL (0.001484): SELECT hosttagid,tag,value,hostid FROM host_tag WHERE hostid=11205 SQL (0.000281): SELECT hi.interfaceid,hi.ip,hi.dns,hi.port,hi.main,hi.type,hi.useip,hi.hostid FROM interface hi WHERE hi.hostid=11205 SQL (0.000271): SELECT ht.* FROM hosts_templates ht WHERE hostid=11205 SQL (0.000109): SELECT ht.* FROM hosts_templates ht WHERE hostid=10109 SQL (0.00316): SELECT COUNT(DISTINCT s.screenid) AS rowscount,s.templateid FROM screens s WHERE s.templateid IS NOT NULL AND s.templateid IN (10109,11205) GROUP BY s.templateid SQL (0.000209): SELECT ht.hostid,ht.templateid FROM hosts_templates ht WHERE ht.hostid=11205 SQL (0.000146): SELECT ht.hostid,ht.templateid FROM hosts_templates ht WHERE ht.hostid=10109 SQL (0.000201): SELECT h.hostid FROM hosts h WHERE h.status=3 AND h.hostid=10109 SQL (0.000124): SELECT hosttagid,tag,value,hostid FROM host_tag WHERE hostid=10109 SQL (5.0E-5): COMMIT SQL (0.00011): BEGIN SQL (0.001352): SELECT DISTINCT t.triggerid FROM triggers t,functions f,items i WHERE i.hostid=11205 AND f.triggerid=t.triggerid AND f.itemid=i.itemid AND NOT EXISTS (SELECT NULL FROM functions f,items i,hosts h WHERE t.triggerid=f.triggerid AND f.itemid=i.itemid AND i.hostid=h.hostid AND (i.status<>0 OR h.status<>0)) AND t.status=0 AND t.flags IN (0,4) SQL (0.000343): SELECT d.triggerid_down,d.triggerid_up,t.value FROM trigger_depends d,triggers t WHERE d.triggerid_up=t.triggerid AND d.triggerid_down IN (38577,76261,76262,76263,76264,76265,76266,76267,76268,76270,76271,76272,76273,76274,76275,76279,76280,76281,76282,76283,76284,76285,76286,76287,76288,76289,76290) SQL (0.000117): SELECT t.triggerid FROM triggers t,functions f,items i,hosts h WHERE t.triggerid=f.triggerid AND f.itemid=i.itemid AND i.hostid=h.hostid AND (i.status=1 OR h.status=1 OR t.status=1) AND 1=0 SQL (0.000424): SELECT f.triggerid,i.hostid FROM functions f,items i WHERE f.triggerid IN (38577,76261,76262,76263,76264,76265,76266,76267,76268,76270,76271,76272,76273,76274,76275,76279,76280,76281,76282,76283,76284,76285,76286,76287,76288,76289,76290) AND f.itemid=i.itemid SQL (0.000179): SELECT h.hostid FROM hosts h WHERE h.flags IN (0,4) AND h.hostid=11205 AND h.status IN (0,1,3) SQL (6.0E-5): COMMIT SQL (5.4E-5): BEGIN SQL (0.000647): SELECT p.eventid,p.objectid,p.severity FROM problem p WHERE p.source='0' AND p.object='0' AND p.objectid IN (38577,76261,76262,76263,76264,76265,76266,76267,76268,76270,76271,76272,76273,76274,76275,76279,76280,76281,76282,76283,76284,76285,76286,76287,76288,76289,76290) AND NOT EXISTS (SELECT NULL FROM event_suppress es WHERE es.eventid=p.eventid) AND p.r_eventid IS NULL SQL (4.9E-5): COMMIT SQL (0.000326): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.layout.mode' SQL (0.000487): SELECT type,value_id,value_int,value_str,idx2 FROM profiles WHERE userid=64 AND idx='web.login.attempt.failed'
| ||||||||
Comment by Andrejs Verza [ 2020 Sep 14 ] | ||||||||
Hi, Seraph! Let me remind you and other users that further development was moved to | ||||||||
Comment by Stefano Enrico Mendola [ 2020 Sep 17 ] | ||||||||
Hi everyone, I've noticed another problem related to this query. | ||||||||
Comment by Velko Ivanov [ 2020 Sep 17 ] | ||||||||
Hi @Stefano, Yes, it is by default unchecked. When you manually check it, it retains the state of the filter while you go back and forth between pages, but when you go there via a "Latest data" link on a host, the filter is reset and the box is unchecked again. This is why in | ||||||||
Comment by Stefano Enrico Mendola [ 2020 Sep 17 ] | ||||||||
Hi Ivanov, Thanks for the quick answer. And I'm glad to hear it's finally being considered | ||||||||
Comment by Velko Ivanov [ 2020 Sep 17 ] | ||||||||
To be clear, I'm not on the Zabbix dev team - I'm just a user participating in the discussion. It was suggested more than a few times, but as far as I am aware, there have been no indications from the team whether this is being considered, or not. | ||||||||
Comment by Valdis Murzins [ 2020 Sep 17 ] | ||||||||
Hello stefano.mendola, This issue is reported and hopefully will be fixed in |