ZABBIX BUGS AND ISSUES

Slow query on housekeeper with PostgreSQL

Details

  • Type: Improvement Improvement
  • Status: Closed Closed
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: 1.8.5
  • Fix Version/s: 1.8.6, 1.9.5 (alpha)
  • Component/s: Server (S)
  • Labels:
    None
  • Environment:
    Ubuntu Lucid 64 bits. DB PostgreSQL 9.0.3. ~24GB RAM, RAID10 array. Shared database (other apps uses this DB).
  • Zabbix ID:
    NA

Description

Greetings,

First: thanks for this great monitoring system. We use it with PostgreSQL on several production servers, and will be able to test and give feedback / patches for PostgreSQL issues / improvements we may find for Zabbix.

I got that query an slow one (with more than 80 seconds query duration):

delete from history where itemid=22330 and oid in (select oid from history where itemid=22330 limit 500;

After taking a look at the code, it is from housekeeper.c , around line 93 (on 1.8.5 source).

Anyway, I just added that index:

create index idx_oid on history (oid);

Please note the index is *not* unique. And now the query takes only 1.5 seconds.

This is a Major issue because it was actually slowing-down the whole server (there are other apps using this DB server).

I think you should include that index into the DB schema.

On other issue (related): that delete query looks a little scary to me, because it doesn't precisely identify which rows it have to delete.... I'd feel more safe if you were to use a more specific query, something like:

delete from history where itemid=22330 and oid in (select oid from history where itemid=22330 order by clock asc limit 500);

That will sort using clock, and delete 500 oldest items, this query currently takes around 2 seconds to run.

I hope this helps to improve this already good monitoring system. I'll take a closer look at the DB usage, and report other findings I may come up.

Sincerely,

Ildefonso Camargo

Activity

Hide
richlv added a comment -

a quick summary of findings so far :

zabbix_server/housekeeper/housekeeper.c in in 1.8 head, line 105 references oid inside #elif defined(HAVE_POSTGRESQL)
it's not done for other databases, there are no comments that would explain why oid is used there.

oid appeared there in revision 3291 by osmiy back in 2006-09-08
this revision did lots of things, so it's not apparent to which change oid appearance could be attributed

Show
richlv added a comment - a quick summary of findings so far : zabbix_server/housekeeper/housekeeper.c in in 1.8 head, line 105 references oid inside #elif defined(HAVE_POSTGRESQL) it's not done for other databases, there are no comments that would explain why oid is used there. oid appeared there in revision 3291 by osmiy back in 2006-09-08 this revision did lots of things, so it's not apparent to which change oid appearance could be attributed
Hide
Ildefonso Camargo added a comment -

Hi richlv,

Well, I believe you uses oid there because you don't have another key to identify each row, and you don' t use serial/bigserial ID columns (which would make the oid unnecessary). Maybe, just using itemid and clock should do the trick.

Ildefonso.

Show
Ildefonso Camargo added a comment - Hi richlv, Well, I believe you uses oid there because you don't have another key to identify each row, and you don' t use serial/bigserial ID columns (which would make the oid unnecessary). Maybe, just using itemid and clock should do the trick. Ildefonso.
Hide
dimir added a comment -

Reproduced the issue with 20 million entries.

"select count(oid) from history"
27189967

"delete from history where itemid=18468 and oid in (select oid from history where itemid=18468 limit 500)"
DELETE 500

real 1m6.152s

Show
dimir added a comment - Reproduced the issue with 20 million entries. "select count(oid) from history" 27189967 "delete from history where itemid=18468 and oid in (select oid from history where itemid=18468 limit 500)" DELETE 500 real 1m6.152s
Hide
dimir added a comment -

After adding an index:

"delete from history where itemid=18468 and oid in (select oid from history where itemid=18468 limit 500)"
DELETE 500

real 0m0.069s

"delete from history where itemid=18468 and oid in (select oid from history where itemid=18468 order by clock asc limit 500)"
DELETE 500

real 0m0.434s

Show
dimir added a comment - After adding an index: "delete from history where itemid=18468 and oid in (select oid from history where itemid=18468 limit 500)" DELETE 500 real 0m0.069s "delete from history where itemid=18468 and oid in (select oid from history where itemid=18468 order by clock asc limit 500)" DELETE 500 real 0m0.434s
Hide
dimir added a comment -

Just for information, in case of 27 million entries in history table oid index would take additional 500 MB of disk space:

without index

schemaname | tablename | size_pretty | total_size_pretty
--------------------------------------------------------+------------------
public | history | 2003 MB | 3313 MB

with index

public | history | 2003 MB | 3896 MB

Show
dimir added a comment - Just for information, in case of 27 million entries in history table oid index would take additional 500 MB of disk space: without index schemaname | tablename | size_pretty | total_size_pretty --------------------------------------------------------+------------------ public | history | 2003 MB | 3313 MB with index public | history | 2003 MB | 3896 MB
Hide
Ildefonso Camargo added a comment -

Greetings,

I actually found this query to be faster:

delete from history where oid in (select oid from history where itemid=22344 order by clock limit 500);

so, we select by oid, from the inner select query,

This one takes only around 74ms on one of the servers.

Ildefonso.

Show
Ildefonso Camargo added a comment - Greetings, I actually found this query to be faster: delete from history where oid in (select oid from history where itemid=22344 order by clock limit 500); so, we select by oid, from the inner select query, This one takes only around 74ms on one of the servers. Ildefonso.
Hide
Ildefonso Camargo added a comment -

Hi,

I just came up with this one:

delete from history where itemid=22344 and clock in (select clock from history where itemid=22344 order by clock asc limit 500);

It just came out of my mind, to remove the OID (that, apparently, is deprecated now).

Now, I have one question on this: why is the "LIMIT 500" needed? can we just do something like?:

delete from history where itemid=22344 and clock<some_time;

Sincerely,

Ildefonso Camargo

Show
Ildefonso Camargo added a comment - Hi, I just came up with this one: delete from history where itemid=22344 and clock in (select clock from history where itemid=22344 order by clock asc limit 500); It just came out of my mind, to remove the OID (that, apparently, is deprecated now). Now, I have one question on this: why is the "LIMIT 500" needed? can we just do something like?: delete from history where itemid=22344 and clock<some_time; Sincerely, Ildefonso Camargo
Hide
Ildefonso Camargo added a comment -

Hi!

Sorry, for some reason I confused this query with the query used to cleanup history for older data.

Now, I believe this function is used to delete all data for a particular item... is it? (housekeeping_process_log), it takes data from housekeeper table (that is inserted from DBdelete_history_by_itemids).

If that's the case, maybe, that query will work:

delete from history where itemid=22344 and clock in (select clock from history where itemid=22344 limit 500);

because I believe clock is unique for a particular itemid (because sample rate can't go under 1 second). I'm trying to avoid the overhead of the oid index (and, maybe, oid at all).

I'm still getting myself a little more into Zabbix code.

Sincerely,

Ildefonso Camargo.

Show
Ildefonso Camargo added a comment - Hi! Sorry, for some reason I confused this query with the query used to cleanup history for older data. Now, I believe this function is used to delete all data for a particular item... is it? (housekeeping_process_log), it takes data from housekeeper table (that is inserted from DBdelete_history_by_itemids). If that's the case, maybe, that query will work: delete from history where itemid=22344 and clock in (select clock from history where itemid=22344 limit 500); because I believe clock is unique for a particular itemid (because sample rate can't go under 1 second). I'm trying to avoid the overhead of the oid index (and, maybe, oid at all). I'm still getting myself a little more into Zabbix code. Sincerely, Ildefonso Camargo.
Hide
dimir added a comment -

Thanks. I have dug a bit more into it and here are my results.

Let's run the benchmarks on a separate box with PostgreSQL 8.4.7 where
nobody is using the DB and the machine is almost idle. Let's run
queries from psql command line with \timing turned on.

The hitory table has 27 mil records:

  1. select count(oid) from history;
    count
    ----------
    27466123
    (1 row)

Time: 24393.350 ms

Current version to cleanup history:

  1. delete from history where itemid=18468 and oid in (select oid from
    history where itemid=18468 limit 500);
    DELETE 500
    Time: 21605.729 ms

Same with ordering:

  1. delete from history where itemid=18468 and oid in (select oid from
    history where itemid=18468 order by clock asc limit 500);
    DELETE 500
    Time: 22670.892 ms

Same without 'where itemid=18468' in outer query:

  1. delete from history where oid in (select oid from history where
    itemid=18468 order by clock asc limit 500);
    DELETE 500
    Time: 21828.539 ms

Not much improvement so far. Let's continue.

Same without using oid (your last suggestion):

  1. delete from history where itemid=18468 and clock in (select clock
    from history where itemid=18468 order by clock asc limit 500);
    DELETE 500
    Time: 65.273 ms

Now this looks fine except for the fact that we have to rely on the
fact that there can not be more than one value in a second. This
worries me a bit because of the trapper item, where you send the
values and there is no limit as to how fast you do it.

So then I've found something different when looking for an answer
on PostgreSQL forums (see below).

Using ctid with 'any (array)':

  1. delete from history where ctid = any (array(select ctid from history
    where itemid=18468 order by clock asc limit 500));
    DELETE 500
    Time: 3.597 ms

Now this looks like a decent solution and we do not rely on the fact
that there can not be more than one value in a second. CTID is one
of the system columns

http://www.postgresql.org/docs/8.4/static/ddl-system-columns.html

Show
dimir added a comment - Thanks. I have dug a bit more into it and here are my results. Let's run the benchmarks on a separate box with PostgreSQL 8.4.7 where nobody is using the DB and the machine is almost idle. Let's run queries from psql command line with \timing turned on. The hitory table has 27 mil records:
  1. select count(oid) from history; count ---------- 27466123 (1 row)
Time: 24393.350 ms Current version to cleanup history:
  1. delete from history where itemid=18468 and oid in (select oid from history where itemid=18468 limit 500); DELETE 500 Time: 21605.729 ms
Same with ordering:
  1. delete from history where itemid=18468 and oid in (select oid from history where itemid=18468 order by clock asc limit 500); DELETE 500 Time: 22670.892 ms
Same without 'where itemid=18468' in outer query:
  1. delete from history where oid in (select oid from history where itemid=18468 order by clock asc limit 500); DELETE 500 Time: 21828.539 ms
Not much improvement so far. Let's continue. Same without using oid (your last suggestion):
  1. delete from history where itemid=18468 and clock in (select clock from history where itemid=18468 order by clock asc limit 500); DELETE 500 Time: 65.273 ms
Now this looks fine except for the fact that we have to rely on the fact that there can not be more than one value in a second. This worries me a bit because of the trapper item, where you send the values and there is no limit as to how fast you do it. So then I've found something different when looking for an answer on PostgreSQL forums (see below). Using ctid with 'any (array)':
  1. delete from history where ctid = any (array(select ctid from history where itemid=18468 order by clock asc limit 500)); DELETE 500 Time: 3.597 ms
Now this looks like a decent solution and we do not rely on the fact that there can not be more than one value in a second. CTID is one of the system columns http://www.postgresql.org/docs/8.4/static/ddl-system-columns.html
Hide
Ildefonso Camargo added a comment -

Hi!

Thanks for working on this, Vladimir!

Yes, that one looks perfect . At some point on the way, I tried the ctid, with worst results (but on a different query). That one looks great, on my overloaded machine it just takes 5 seconds (down from 25 it was taking with my latest attempt). The problem with this machine is: my indexes are getting "kicked out" of cache by the rest of the system's load, thus, the 5 seconds run time (if I run the query two times in a row, I get <10ms times).

Anyway, while reading the source, it looks like the order is not necessary, because this query is being used to delete all of a hosts data (apparently)... is that what that query is being used for? (deleting a whole host/item), if so: we could rewrite it as:

delete from history where ctid = any (array(select ctid from history where itemid=18468 limit 500));

Making it even faster, because if we are going to completely delete an item's data, we don't really care in which order are these being deleted. Now, delete the data in 500 rows batches per-hour (according to what I saw on the logs, the housekeeper were running these once per hour), is maybe, too slow... why not delete 2000~5000 at a time? because, see, a common item, sampled at a 30 seconds interval, and having a 15 days history will take over 80 hours to get deleted at this rate! (500/hour), and the server has to do an index scan to delete just 500 items...

Another suggestion I could give is: table partitioning. According to what I read, the history* tables can get huge real quick. Lets say, for 20 hosts, and 70 items per host:

Most items gets sampled at a 30 seconds rate, so, that's 120 entries per hour, and 24 hours a day:

120*24=2880 rows per item a day.

2880 rows/day * 15 days=43200 rows per item.

70 items/host * 43200rows/item = 3024000 rows/host

And now, 20 hosts (just 20 hosts): ~60.480.000 rows

Now, I actually saw there were different history tables, but yet, the tables can get huge real fast (let alone if I wanted to, for example, keep 90 days of data). Adding table partitioning could be a good idea for future Zabbix versions (and doesn't require application code modification, just DB schema):

http://www.postgresql.org/docs/current/interactive/ddl-partitioning.html

Sincerely,

Ildefonso Camargo

Show
Ildefonso Camargo added a comment - Hi! Thanks for working on this, Vladimir! Yes, that one looks perfect . At some point on the way, I tried the ctid, with worst results (but on a different query). That one looks great, on my overloaded machine it just takes 5 seconds (down from 25 it was taking with my latest attempt). The problem with this machine is: my indexes are getting "kicked out" of cache by the rest of the system's load, thus, the 5 seconds run time (if I run the query two times in a row, I get <10ms times). Anyway, while reading the source, it looks like the order is not necessary, because this query is being used to delete all of a hosts data (apparently)... is that what that query is being used for? (deleting a whole host/item), if so: we could rewrite it as: delete from history where ctid = any (array(select ctid from history where itemid=18468 limit 500)); Making it even faster, because if we are going to completely delete an item's data, we don't really care in which order are these being deleted. Now, delete the data in 500 rows batches per-hour (according to what I saw on the logs, the housekeeper were running these once per hour), is maybe, too slow... why not delete 2000~5000 at a time? because, see, a common item, sampled at a 30 seconds interval, and having a 15 days history will take over 80 hours to get deleted at this rate! (500/hour), and the server has to do an index scan to delete just 500 items... Another suggestion I could give is: table partitioning. According to what I read, the history* tables can get huge real quick. Lets say, for 20 hosts, and 70 items per host: Most items gets sampled at a 30 seconds rate, so, that's 120 entries per hour, and 24 hours a day: 120*24=2880 rows per item a day. 2880 rows/day * 15 days=43200 rows per item. 70 items/host * 43200rows/item = 3024000 rows/host And now, 20 hosts (just 20 hosts): ~60.480.000 rows Now, I actually saw there were different history tables, but yet, the tables can get huge real fast (let alone if I wanted to, for example, keep 90 days of data). Adding table partitioning could be a good idea for future Zabbix versions (and doesn't require application code modification, just DB schema): http://www.postgresql.org/docs/current/interactive/ddl-partitioning.html Sincerely, Ildefonso Camargo
Hide
dimir added a comment -

Thanks for the input, Ildefonso!

To quickly answer your "why 500 entries limit" question, this is only a matter of configuration. You are free to set the

MaxHousekeeperDelete

parameter in server config to up to 1000000 (more info: http://www.zabbix.com/documentation/1.8/complete#individual_processes).

Show
dimir added a comment - Thanks for the input, Ildefonso! To quickly answer your "why 500 entries limit" question, this is only a matter of configuration. You are free to set the MaxHousekeeperDelete parameter in server config to up to 1000000 (more info: http://www.zabbix.com/documentation/1.8/complete#individual_processes).
Hide
Ildefonso Camargo added a comment -

Duh... you are right , because it was all uppercase on the code, I thought it was a hard-coded constant.

Show
Ildefonso Camargo added a comment - Duh... you are right , because it was all uppercase on the code, I thought it was a hard-coded constant.
Hide
dimir added a comment -

Ony disadvantage of this solution. array constructors appeared in PostgreSQL 7.4 and officially zabbix-1.8 has to support 7.0.2 or later.

http://www.postgresql.org/docs/current/static/release-7-4.html (E.120.3.7. Data Type and Function Changes).
http://www.postgresql.org/docs/7.4/static/sql-expressions.html#SQL-SYNTAX-ARRAY-CONSTRUCTORS

We can raise PostgreSQL version requirement for 2.0 but not sure about 1.8

Show
dimir added a comment - Ony disadvantage of this solution. array constructors appeared in PostgreSQL 7.4 and officially zabbix-1.8 has to support 7.0.2 or later. http://www.postgresql.org/docs/current/static/release-7-4.html (E.120.3.7. Data Type and Function Changes). http://www.postgresql.org/docs/7.4/static/sql-expressions.html#SQL-SYNTAX-ARRAY-CONSTRUCTORS We can raise PostgreSQL version requirement for 2.0 but not sure about 1.8
Hide
dimir added a comment -

We could check if array constructors are available like this:

select 1 from pg_proc where proname='string_to_array';

And then if no rows are returned use old sql to run housekeeper jobs otherwise use new.

Show
dimir added a comment - We could check if array constructors are available like this: select 1 from pg_proc where proname='string_to_array'; And then if no rows are returned use old sql to run housekeeper jobs otherwise use new.
Hide
Ildefonso Camargo added a comment -

Hi!

Well, maybe it shouldn't use the old query, the query:

delete from history where itemid=22344 and clock in (select clock from history where itemid=22344 limit 500);

Should work with the older 7.0 pgsql, and is better than the original one (and doesn't require to add the oid index, thus simplifying the upgrade process). See, due that we are deleting all rows from the table, where itemid=N , it doesn't really matter if the clock isn't unique for the same itemid.

And, off course, for newer versions we should use the newer one:

delete from history where ctid = any(array(select ctid from history where itemid=18468 limit 500));

I don't have a 7.0 pgsql laying around..... so, I can't verify the query to check for array support that you sent me, it should work (as far as I can remember).... however, I could bet there is no one running 7.0 these days.

About supported version for 2.0, I would recommend supporting 8.3 and up (because of the many improvements, and the fact that will be supported for at least a couple of years).

Sincerely,

Ildefonso.

Show
Ildefonso Camargo added a comment - Hi! Well, maybe it shouldn't use the old query, the query: delete from history where itemid=22344 and clock in (select clock from history where itemid=22344 limit 500); Should work with the older 7.0 pgsql, and is better than the original one (and doesn't require to add the oid index, thus simplifying the upgrade process). See, due that we are deleting all rows from the table, where itemid=N , it doesn't really matter if the clock isn't unique for the same itemid. And, off course, for newer versions we should use the newer one: delete from history where ctid = any(array(select ctid from history where itemid=18468 limit 500)); I don't have a 7.0 pgsql laying around..... so, I can't verify the query to check for array support that you sent me, it should work (as far as I can remember).... however, I could bet there is no one running 7.0 these days. About supported version for 2.0, I would recommend supporting 8.3 and up (because of the many improvements, and the fact that will be supported for at least a couple of years). Sincerely, Ildefonso.
Hide
Joshua D. Drake added a comment -

I think it makes the most sense to only support versions that PostgreSQL.Org supports, even for 1.8. As of January 1st, 2012, only 8.3+ is to be supported, and 8.2 is out the door 12/2011. See here:

http://wiki.postgresql.org/wiki/PostgreSQL_Release_Support_Policy

Sincerely,

Joshua D. Drake
President
United States PostgreSQL

Show
Joshua D. Drake added a comment - I think it makes the most sense to only support versions that PostgreSQL.Org supports, even for 1.8. As of January 1st, 2012, only 8.3+ is to be supported, and 8.2 is out the door 12/2011. See here: http://wiki.postgresql.org/wiki/PostgreSQL_Release_Support_Policy Sincerely, Joshua D. Drake President United States PostgreSQL
Hide
dimir added a comment -

I have checked it on a PostgreSQL 7.2 and the parser was complaining at "array".

Show
dimir added a comment - I have checked it on a PostgreSQL 7.2 and the parser was complaining at "array".
Hide
dimir added a comment -

Thank you Joshua, we will definitely use that table for upcoming 2.0 release.

Show
dimir added a comment - Thank you Joshua, we will definitely use that table for upcoming 2.0 release.
Hide
dimir added a comment -

We have discussed it and decided to check for PostgreSQL version and execute query accordingly in 1.8. In 2.0 it will be executed with CTID and array without checks.

Show
dimir added a comment - We have discussed it and decided to check for PostgreSQL version and execute query accordingly in 1.8. In 2.0 it will be executed with CTID and array without checks.
Hide
Ildefonso Camargo added a comment -

Great!

Do you have a patch of your current changes, so that I can apply to 1.8.5 and test it on our servers (and give feedback).

Thanks!

Ildefonso.

Show
Ildefonso Camargo added a comment - Great! Do you have a patch of your current changes, so that I can apply to 1.8.5 and test it on our servers (and give feedback). Thanks! Ildefonso.
Hide
dimir added a comment -

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-3877

Show
dimir added a comment - Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-3877
Hide
dimir added a comment -

Ildefonso, I will attach the patch as soon as it is reviewed.

Show
dimir added a comment - Ildefonso, I will attach the patch as soon as it is reviewed.
Hide
richlv added a comment -

also note that you can always try out the development branch

Show
richlv added a comment - also note that you can always try out the development branch
Hide
dimir added a comment -

Fixed in 1.8 r20464, trunk r20465 .

Show
dimir added a comment - Fixed in 1.8 r20464, trunk r20465 .

People

Vote (0)
Watch (1)

Dates

  • Created:
    Updated:
    Resolved: