[ZBX-3877] Slow query on housekeeper with PostgreSQL Created: 2011 Jun 10 Updated: 2020 Feb 17 Resolved: 2011 Jul 07 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 1.8.5 |
Fix Version/s: | 1.8.6, 1.9.5 (alpha) |
Type: | Incident report | Priority: | Major |
Reporter: | Ildefonso Camargo | Assignee: | dimir |
Resolution: | Fixed | Votes: | 0 |
Labels: | None | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Ubuntu Lucid 64 bits. DB PostgreSQL 9.0.3. ~24GB RAM, RAID10 array. Shared database (other apps uses this DB). |
Attachments: | ZBX-3877-housekeeper-postgresql.patch |
Description |
Greetings, First: thanks for this great monitoring system. We use it with PostgreSQL on several production servers, and will be able to test and give feedback / patches for PostgreSQL issues / improvements we may find for Zabbix. I got that query an slow one (with more than 80 seconds query duration): delete from history where itemid=22330 and oid in (select oid from history where itemid=22330 limit 500; After taking a look at the code, it is from housekeeper.c , around line 93 (on 1.8.5 source). Anyway, I just added that index: create index idx_oid on history (oid); Please note the index is not unique. And now the query takes only 1.5 seconds. This is a Major issue because it was actually slowing-down the whole server (there are other apps using this DB server). I think you should include that index into the DB schema. On other issue (related): that delete query looks a little scary to me, because it doesn't precisely identify which rows it have to delete.... I'd feel more safe if you were to use a more specific query, something like: delete from history where itemid=22330 and oid in (select oid from history where itemid=22330 order by clock asc limit 500); That will sort using clock, and delete 500 oldest items, this query currently takes around 2 seconds to run. I hope this helps to improve this already good monitoring system. I'll take a closer look at the DB usage, and report other findings I may come up. Sincerely, Ildefonso Camargo |
Comments |
Comment by richlv [ 2011 Jun 10 ] |
a quick summary of findings so far : zabbix_server/housekeeper/housekeeper.c in in 1.8 head, line 105 references oid inside #elif defined(HAVE_POSTGRESQL) oid appeared there in revision 3291 by osmiy back in 2006-09-08 |
Comment by Ildefonso Camargo [ 2011 Jun 10 ] |
Hi richlv, Well, I believe you uses oid there because you don't have another key to identify each row, and you don' t use serial/bigserial ID columns (which would make the oid unnecessary). Maybe, just using itemid and clock should do the trick. Ildefonso. |
Comment by dimir [ 2011 Jun 17 ] |
Reproduced the issue with 20 million entries. "select count(oid) from history" "delete from history where itemid=18468 and oid in (select oid from history where itemid=18468 limit 500)" real 1m6.152s |
Comment by dimir [ 2011 Jun 17 ] |
After adding an index: "delete from history where itemid=18468 and oid in (select oid from history where itemid=18468 limit 500)" real 0m0.069s "delete from history where itemid=18468 and oid in (select oid from history where itemid=18468 order by clock asc limit 500)" real 0m0.434s |
Comment by dimir [ 2011 Jun 17 ] |
Just for information, in case of 27 million entries in history table oid index would take additional 500 MB of disk space: without index schemaname | tablename | size_pretty | total_size_pretty with index public | history | 2003 MB | 3896 MB |
Comment by Ildefonso Camargo [ 2011 Jun 18 ] |
Greetings, I actually found this query to be faster: delete from history where oid in (select oid from history where itemid=22344 order by clock limit 500); so, we select by oid, from the inner select query, This one takes only around 74ms on one of the servers. Ildefonso. |
Comment by Ildefonso Camargo [ 2011 Jun 18 ] |
Hi, I just came up with this one: delete from history where itemid=22344 and clock in (select clock from history where itemid=22344 order by clock asc limit 500); It just came out of my mind, to remove the OID (that, apparently, is deprecated now). Now, I have one question on this: why is the "LIMIT 500" needed? can we just do something like?: delete from history where itemid=22344 and clock<some_time; Sincerely, Ildefonso Camargo |
Comment by Ildefonso Camargo [ 2011 Jun 18 ] |
Hi! Sorry, for some reason I confused this query with the query used to cleanup history for older data. Now, I believe this function is used to delete all data for a particular item... is it? (housekeeping_process_log), it takes data from housekeeper table (that is inserted from DBdelete_history_by_itemids). If that's the case, maybe, that query will work: delete from history where itemid=22344 and clock in (select clock from history where itemid=22344 limit 500); because I believe clock is unique for a particular itemid (because sample rate can't go under 1 second). I'm trying to avoid the overhead of the oid index (and, maybe, oid at all). I'm still getting myself a little more into Zabbix code. Sincerely, Ildefonso Camargo. |
Comment by dimir [ 2011 Jun 20 ] |
Thanks. I have dug a bit more into it and here are my results. Let's run the benchmarks on a separate box with PostgreSQL 8.4.7 where The hitory table has 27 mil records:
Time: 24393.350 ms Current version to cleanup history:
Same with ordering:
Same without 'where itemid=18468' in outer query:
Not much improvement so far. Let's continue. Same without using oid (your last suggestion):
Now this looks fine except for the fact that we have to rely on the So then I've found something different when looking for an answer Using ctid with 'any (array)':
Now this looks like a decent solution and we do not rely on the fact http://www.postgresql.org/docs/8.4/static/ddl-system-columns.html |
Comment by Ildefonso Camargo [ 2011 Jun 20 ] |
Hi! Thanks for working on this, Vladimir! Yes, that one looks perfect . At some point on the way, I tried the ctid, with worst results (but on a different query). That one looks great, on my overloaded machine it just takes 5 seconds (down from 25 it was taking with my latest attempt). The problem with this machine is: my indexes are getting "kicked out" of cache by the rest of the system's load, thus, the 5 seconds run time (if I run the query two times in a row, I get <10ms times). Anyway, while reading the source, it looks like the order is not necessary, because this query is being used to delete all of a hosts data (apparently)... is that what that query is being used for? (deleting a whole host/item), if so: we could rewrite it as: delete from history where ctid = any (array(select ctid from history where itemid=18468 limit 500)); Making it even faster, because if we are going to completely delete an item's data, we don't really care in which order are these being deleted. Now, delete the data in 500 rows batches per-hour (according to what I saw on the logs, the housekeeper were running these once per hour), is maybe, too slow... why not delete 2000~5000 at a time? because, see, a common item, sampled at a 30 seconds interval, and having a 15 days history will take over 80 hours to get deleted at this rate! (500/hour), and the server has to do an index scan to delete just 500 items... Another suggestion I could give is: table partitioning. According to what I read, the history* tables can get huge real quick. Lets say, for 20 hosts, and 70 items per host: Most items gets sampled at a 30 seconds rate, so, that's 120 entries per hour, and 24 hours a day: 120*24=2880 rows per item a day. 2880 rows/day * 15 days=43200 rows per item. 70 items/host * 43200rows/item = 3024000 rows/host And now, 20 hosts (just 20 hosts): ~60.480.000 rows Now, I actually saw there were different history tables, but yet, the tables can get huge real fast (let alone if I wanted to, for example, keep 90 days of data). Adding table partitioning could be a good idea for future Zabbix versions (and doesn't require application code modification, just DB schema): http://www.postgresql.org/docs/current/interactive/ddl-partitioning.html Sincerely, Ildefonso Camargo |
Comment by dimir [ 2011 Jun 20 ] |
Thanks for the input, Ildefonso! To quickly answer your "why 500 entries limit" question, this is only a matter of configuration. You are free to set the MaxHousekeeperDelete parameter in server config to up to 1000000 (more info: http://www.zabbix.com/documentation/1.8/complete#individual_processes). |
Comment by Ildefonso Camargo [ 2011 Jun 20 ] |
Duh... you are right , because it was all uppercase on the code, I thought it was a hard-coded constant. |
Comment by dimir [ 2011 Jun 20 ] |
Ony disadvantage of this solution. array constructors appeared in PostgreSQL 7.4 and officially zabbix-1.8 has to support 7.0.2 or later. http://www.postgresql.org/docs/current/static/release-7-4.html (E.120.3.7. Data Type and Function Changes). We can raise PostgreSQL version requirement for 2.0 but not sure about 1.8 |
Comment by dimir [ 2011 Jun 20 ] |
We could check if array constructors are available like this: select 1 from pg_proc where proname='string_to_array'; And then if no rows are returned use old sql to run housekeeper jobs otherwise use new. |
Comment by Ildefonso Camargo [ 2011 Jun 20 ] |
Hi! Well, maybe it shouldn't use the old query, the query: delete from history where itemid=22344 and clock in (select clock from history where itemid=22344 limit 500); Should work with the older 7.0 pgsql, and is better than the original one (and doesn't require to add the oid index, thus simplifying the upgrade process). See, due that we are deleting all rows from the table, where itemid=N , it doesn't really matter if the clock isn't unique for the same itemid. And, off course, for newer versions we should use the newer one: delete from history where ctid = any(array(select ctid from history where itemid=18468 limit 500)); I don't have a 7.0 pgsql laying around..... so, I can't verify the query to check for array support that you sent me, it should work (as far as I can remember).... however, I could bet there is no one running 7.0 these days. About supported version for 2.0, I would recommend supporting 8.3 and up (because of the many improvements, and the fact that will be supported for at least a couple of years). Sincerely, Ildefonso. |
Comment by Joshua D. Drake [ 2011 Jun 20 ] |
I think it makes the most sense to only support versions that PostgreSQL.Org supports, even for 1.8. As of January 1st, 2012, only 8.3+ is to be supported, and 8.2 is out the door 12/2011. See here: http://wiki.postgresql.org/wiki/PostgreSQL_Release_Support_Policy Sincerely, Joshua D. Drake |
Comment by dimir [ 2011 Jun 21 ] |
I have checked it on a PostgreSQL 7.2 and the parser was complaining at "array". |
Comment by dimir [ 2011 Jun 21 ] |
Thank you Joshua, we will definitely use that table for upcoming 2.0 release. |
Comment by dimir [ 2011 Jun 21 ] |
We have discussed it and decided to check for PostgreSQL version and execute query accordingly in 1.8. In 2.0 it will be executed with CTID and array without checks. |
Comment by Ildefonso Camargo [ 2011 Jun 22 ] |
Great! Do you have a patch of your current changes, so that I can apply to 1.8.5 and test it on our servers (and give feedback). Thanks! Ildefonso. |
Comment by dimir [ 2011 Jun 22 ] |
Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-3877 |
Comment by dimir [ 2011 Jun 22 ] |
Ildefonso, I will attach the patch as soon as it is reviewed. |
Comment by richlv [ 2011 Jun 22 ] |
also note that you can always try out the development branch |
Comment by dimir [ 2011 Jul 07 ] |
Fixed in 1.8 r20464, trunk r20465 . |