[#ZBX-3877] Slow query on housekeeper with PostgreSQL

[ZBX-3877] Slow query on housekeeper with PostgreSQL Created: 2011 Jun 10 Updated: 2020 Feb 17 Resolved: 2011 Jul 07
Status:	Closed
Project:	ZABBIX BUGS AND ISSUES
Component/s:	Server (S)
Affects Version/s:	1.8.5
Fix Version/s:	1.8.6, 1.9.5 (alpha)

Type:

Incident report

Priority:

Major

Reporter:

Ildefonso Camargo

Assignee:

dimir

Resolution:

Fixed

Votes:

Labels:

None

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original Estimate:

Not Specified

Environment:

Ubuntu Lucid 64 bits. DB PostgreSQL 9.0.3. ~24GB RAM, RAID10 array. Shared database (other apps uses this DB).

Attachments:

ZBX-3877-housekeeper-postgresql.patch

Description

Greetings,

First: thanks for this great monitoring system. We use it with PostgreSQL on several production servers, and will be able to test and give feedback / patches for PostgreSQL issues / improvements we may find for Zabbix.

I got that query an slow one (with more than 80 seconds query duration):

delete from history where itemid=22330 and oid in (select oid from history where itemid=22330 limit 500;

After taking a look at the code, it is from housekeeper.c , around line 93 (on 1.8.5 source).

Anyway, I just added that index:

create index idx_oid on history (oid);

Please note the index is not unique. And now the query takes only 1.5 seconds.

This is a Major issue because it was actually slowing-down the whole server (there are other apps using this DB server).

I think you should include that index into the DB schema.

On other issue (related): that delete query looks a little scary to me, because it doesn't precisely identify which rows it have to delete.... I'd feel more safe if you were to use a more specific query, something like:

delete from history where itemid=22330 and oid in (select oid from history where itemid=22330 order by clock asc limit 500);

That will sort using clock, and delete 500 oldest items, this query currently takes around 2 seconds to run.

I hope this helps to improve this already good monitoring system. I'll take a closer look at the DB usage, and report other findings I may come up.

Sincerely,

Ildefonso Camargo

Comments

Comment by richlv [ 2011 Jun 10 ]

a quick summary of findings so far :

zabbix_server/housekeeper/housekeeper.c in in 1.8 head, line 105 references oid inside #elif defined(HAVE_POSTGRESQL)
it's not done for other databases, there are no comments that would explain why oid is used there.

oid appeared there in revision 3291 by osmiy back in 2006-09-08
this revision did lots of things, so it's not apparent to which change oid appearance could be attributed

Comment by Ildefonso Camargo [ 2011 Jun 10 ]

Hi richlv,

Well, I believe you uses oid there because you don't have another key to identify each row, and you don' t use serial/bigserial ID columns (which would make the oid unnecessary). Maybe, just using itemid and clock should do the trick.

Ildefonso.

Comment by dimir [ 2011 Jun 17 ]

Reproduced the issue with 20 million entries.

"select count(oid) from history"
27189967

"delete from history where itemid=18468 and oid in (select oid from history where itemid=18468 limit 500)"
DELETE 500

real 1m6.152s

Comment by dimir [ 2011 Jun 17 ]

After adding an index:

"delete from history where itemid=18468 and oid in (select oid from history where itemid=18468 limit 500)"
DELETE 500

real 0m0.069s

"delete from history where itemid=18468 and oid in (select oid from history where itemid=18468 order by clock asc limit 500)"
DELETE 500

real 0m0.434s

Comment by dimir [ 2011 Jun 17 ]

Just for information, in case of 27 million entries in history table oid index would take additional 500 MB of disk space:

without index

with index

public | history | 2003 MB | 3896 MB

Comment by Ildefonso Camargo [ 2011 Jun 18 ]

Greetings,

I actually found this query to be faster:

delete from history where oid in (select oid from history where itemid=22344 order by clock limit 500);

so, we select by oid, from the inner select query,

This one takes only around 74ms on one of the servers.

Ildefonso.

Comment by Ildefonso Camargo [ 2011 Jun 18 ]

Hi,

I just came up with this one:

delete from history where itemid=22344 and clock in (select clock from history where itemid=22344 order by clock asc limit 500);

It just came out of my mind, to remove the OID (that, apparently, is deprecated now).

Now, I have one question on this: why is the "LIMIT 500" needed? can we just do something like?:

delete from history where itemid=22344 and clock<some_time;

Sincerely,

Ildefonso Camargo

Comment by Ildefonso Camargo [ 2011 Jun 18 ]

Hi!

Sorry, for some reason I confused this query with the query used to cleanup history for older data.

Now, I believe this function is used to delete all data for a particular item... is it? (housekeeping_process_log), it takes data from housekeeper table (that is inserted from DBdelete_history_by_itemids).

If that's the case, maybe, that query will work:

delete from history where itemid=22344 and clock in (select clock from history where itemid=22344 limit 500);

because I believe clock is unique for a particular itemid (because sample rate can't go under 1 second). I'm trying to avoid the overhead of the oid index (and, maybe, oid at all).

I'm still getting myself a little more into Zabbix code.

Sincerely,

Ildefonso Camargo.

Comment by dimir [ 2011 Jun 20 ]

Thanks. I have dug a bit more into it and here are my results.

Let's run the benchmarks on a separate box with PostgreSQL 8.4.7 where
nobody is using the DB and the machine is almost idle. Let's run
queries from psql command line with \timing turned on.

The hitory table has 27 mil records:

select count(oid) from history;
count
----------
27466123
(1 row)

Time: 24393.350 ms

Current version to cleanup history:

delete from history where itemid=18468 and oid in (select oid from
history where itemid=18468 limit 500);
DELETE 500
Time: 21605.729 ms

Same with ordering:

delete from history where itemid=18468 and oid in (select oid from
history where itemid=18468 order by clock asc limit 500);
DELETE 500
Time: 22670.892 ms

Same without 'where itemid=18468' in outer query:

delete from history where oid in (select oid from history where
itemid=18468 order by clock asc limit 500);
DELETE 500
Time: 21828.539 ms

Not much improvement so far. Let's continue.

Same without using oid (your last suggestion):

delete from history where itemid=18468 and clock in (select clock
from history where itemid=18468 order by clock asc limit 500);
DELETE 500
Time: 65.273 ms

Now this looks fine except for the fact that we have to rely on the
fact that there can not be more than one value in a second. This
worries me a bit because of the trapper item, where you send the
values and there is no limit as to how fast you do it.

So then I've found something different when looking for an answer
on PostgreSQL forums (see below).

Using ctid with 'any (array)':

delete from history where ctid = any (array(select ctid from history
where itemid=18468 order by clock asc limit 500));
DELETE 500
Time: 3.597 ms

Now this looks like a decent solution and we do not rely on the fact
that there can not be more than one value in a second. CTID is one
of the system columns

http://www.postgresql.org/docs/8.4/static/ddl-system-columns.html

Comment by Ildefonso Camargo [ 2011 Jun 20 ]

Hi!

Thanks for working on this, Vladimir!

Yes, that one looks perfect . At some point on the way, I tried the ctid, with worst results (but on a different query). That one looks great, on my overloaded machine it just takes 5 seconds (down from 25 it was taking with my latest attempt). The problem with this machine is: my indexes are getting "kicked out" of cache by the rest of the system's load, thus, the 5 seconds run time (if I run the query two times in a row, I get <10ms times).

Anyway, while reading the source, it looks like the order is not necessary, because this query is being used to delete all of a hosts data (apparently)... is that what that query is being used for? (deleting a whole host/item), if so: we could rewrite it as:

delete from history where ctid = any (array(select ctid from history where itemid=18468 limit 500));

Making it even faster, because if we are going to completely delete an item's data, we don't really care in which order are these being deleted. Now, delete the data in 500 rows batches per-hour (according to what I saw on the logs, the housekeeper were running these once per hour), is maybe, too slow... why not delete 2000~5000 at a time? because, see, a common item, sampled at a 30 seconds interval, and having a 15 days history will take over 80 hours to get deleted at this rate! (500/hour), and the server has to do an index scan to delete just 500 items...

Another suggestion I could give is: table partitioning. According to what I read, the history* tables can get huge real quick. Lets say, for 20 hosts, and 70 items per host:

Most items gets sampled at a 30 seconds rate, so, that's 120 entries per hour, and 24 hours a day:

120*24=2880 rows per item a day.

2880 rows/day * 15 days=43200 rows per item.

70 items/host * 43200rows/item = 3024000 rows/host

And now, 20 hosts (just 20 hosts): ~60.480.000 rows

Now, I actually saw there were different history tables, but yet, the tables can get huge real fast (let alone if I wanted to, for example, keep 90 days of data). Adding table partitioning could be a good idea for future Zabbix versions (and doesn't require application code modification, just DB schema):

http://www.postgresql.org/docs/current/interactive/ddl-partitioning.html

Sincerely,

Ildefonso Camargo

Comment by dimir [ 2011 Jun 20 ]

Thanks for the input, Ildefonso!

To quickly answer your "why 500 entries limit" question, this is only a matter of configuration. You are free to set the

MaxHousekeeperDelete

parameter in server config to up to 1000000 (more info: http://www.zabbix.com/documentation/1.8/complete#individual_processes).

Comment by Ildefonso Camargo [ 2011 Jun 20 ]

Duh... you are right , because it was all uppercase on the code, I thought it was a hard-coded constant.

Comment by dimir [ 2011 Jun 20 ]

Ony disadvantage of this solution. array constructors appeared in PostgreSQL 7.4 and officially zabbix-1.8 has to support 7.0.2 or later.

http://www.postgresql.org/docs/current/static/release-7-4.html (E.120.3.7. Data Type and Function Changes).
http://www.postgresql.org/docs/7.4/static/sql-expressions.html#SQL-SYNTAX-ARRAY-CONSTRUCTORS

We can raise PostgreSQL version requirement for 2.0 but not sure about 1.8

Comment by dimir [ 2011 Jun 20 ]

We could check if array constructors are available like this:

select 1 from pg_proc where proname='string_to_array';

And then if no rows are returned use old sql to run housekeeper jobs otherwise use new.

Comment by Ildefonso Camargo [ 2011 Jun 20 ]

Hi!

Well, maybe it shouldn't use the old query, the query:

delete from history where itemid=22344 and clock in (select clock from history where itemid=22344 limit 500);

Should work with the older 7.0 pgsql, and is better than the original one (and doesn't require to add the oid index, thus simplifying the upgrade process). See, due that we are deleting all rows from the table, where itemid=N , it doesn't really matter if the clock isn't unique for the same itemid.

And, off course, for newer versions we should use the newer one:

delete from history where ctid = any(array(select ctid from history where itemid=18468 limit 500));

I don't have a 7.0 pgsql laying around..... so, I can't verify the query to check for array support that you sent me, it should work (as far as I can remember).... however, I could bet there is no one running 7.0 these days.

About supported version for 2.0, I would recommend supporting 8.3 and up (because of the many improvements, and the fact that will be supported for at least a couple of years).

Sincerely,

Ildefonso.

Comment by Joshua D. Drake [ 2011 Jun 20 ]

I think it makes the most sense to only support versions that PostgreSQL.Org supports, even for 1.8. As of January 1st, 2012, only 8.3+ is to be supported, and 8.2 is out the door 12/2011. See here:

http://wiki.postgresql.org/wiki/PostgreSQL_Release_Support_Policy

Sincerely,

Joshua D. Drake
President
United States PostgreSQL

Comment by dimir [ 2011 Jun 21 ]

I have checked it on a PostgreSQL 7.2 and the parser was complaining at "array".

Comment by dimir [ 2011 Jun 21 ]

Thank you Joshua, we will definitely use that table for upcoming 2.0 release.

Comment by dimir [ 2011 Jun 21 ]

We have discussed it and decided to check for PostgreSQL version and execute query accordingly in 1.8. In 2.0 it will be executed with CTID and array without checks.

Comment by Ildefonso Camargo [ 2011 Jun 22 ]

Great!

Do you have a patch of your current changes, so that I can apply to 1.8.5 and test it on our servers (and give feedback).

Thanks!

Ildefonso.

Comment by dimir [ 2011 Jun 22 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-3877

Comment by dimir [ 2011 Jun 22 ]

Ildefonso, I will attach the patch as soon as it is reviewed.

Comment by richlv [ 2011 Jun 22 ]

also note that you can always try out the development branch

Comment by dimir [ 2011 Jul 07 ]

Fixed in 1.8 r20464, trunk r20465 .

Generated at Sat Aug 02 09:51:14 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.

[ZBX-3877] Slow query on housekeeper with PostgreSQL Created: 2011 Jun 10 Updated: 2020 Feb 17 Resolved: 2011 Jul 07