[#ZBX-15222] Housekeeping not working

Comment by Vladislavs Sokurenko [ 2018 Nov 27 ]

Can you please provide output of following command ps -ax | grep housekeeper

Comment by Sean Nienaber [ 2018 Nov 28 ]

Here you go:

root@VIAMON01:~# ps -ax | grep housekeeper
8532 ? S 75:51 /usr/sbin/zabbix_server: housekeeper [deleted 1072744 hist/trends, 0 items/triggers, 485 events, 0 sessions, 0 alarms, 48 audit items in 326.620899 sec, idle for 1 h
9599 pts/0 S+ 0:00 grep --color=auto housekeeper
root@VIAMON01:~#

And this is in the current log:

8532:20181127:230214.167 executing housekeeper
8532:20181127:230757.736 housekeeper [deleted 1109245 hist/trends, 0 items/triggers, 582 events, 49160 problems, 0 sessions, 0 alarms, 48 audit items in 343.568039 sec, idle for 1 hour(s)]
8532:20181128:000758.490 executing housekeeper
8532:20181128:001339.866 housekeeper [deleted 1081922 hist/trends, 0 items/triggers, 632 events, 41895 problems, 0 sessions, 0 alarms, 48 audit items in 341.374635 sec, idle for 1 hour(s)]
8532:20181128:011340.629 executing housekeeper
8532:20181128:011923.259 housekeeper [deleted 1074623 hist/trends, 0 items/triggers, 616 events, 46619 problems, 0 sessions, 0 alarms, 48 audit items in 342.629022 sec, idle for 1 hour(s)]
8532:20181128:021924.059 executing housekeeper
8532:20181128:022440.779 housekeeper [deleted 1067853 hist/trends, 0 items/triggers, 645 events, 43583 problems, 0 sessions, 0 alarms, 48 audit items in 316.719123 sec, idle for 1 hour(s)]
8532:20181128:032443.143 executing housekeeper
8532:20181128:033006.734 housekeeper [deleted 1066194 hist/trends, 0 items/triggers, 4181 events, 44856 problems, 0 sessions, 0 alarms, 48 audit items in 323.589667 sec, idle for 1 hour(s)]
8532:20181128:043007.437 executing housekeeper
8532:20181128:043617.637 housekeeper [deleted 1067690 hist/trends, 0 items/triggers, 11451 events, 42967 problems, 0 sessions, 0 alarms, 48 audit items in 370.199575 sec, idle for 1 hour(s)]
8532:20181128:053618.360 executing housekeeper
8532:20181128:054140.004 housekeeper [deleted 1073042 hist/trends, 0 items/triggers, 475 events, 48047 problems, 0 sessions, 0 alarms, 48 audit items in 321.642726 sec, idle for 1 hour(s)]
8532:20181128:064140.767 executing housekeeper
8532:20181128:064707.390 housekeeper [deleted 1072744 hist/trends, 0 items/triggers, 485 events, 46757 problems, 0 sessions, 0 alarms, 48 audit items in 326.620899 sec, idle for 1 hour(s)]

Comment by Sean Nienaber [ 2018 Nov 28 ]

Hi Vladislavs,

Here's the housekeeper config:

### Option: HousekeepingFrequency
# How often Zabbix will perform housekeeping procedure (in hours).
# Housekeeping is removing unnecessary information from history, alert, and alarms tables.
#
# Mandatory: no
# Range: 1-24
# Default:
# HousekeepingFrequency=1
### Option: MaxHousekeeperDelete
# The table "housekeeper" contains "tasks" for housekeeping procedure in the format:
# [housekeeperid], [tablename], [field], [value].
# No more than 'MaxHousekeeperDelete' rows (corresponding to [tablename], [field], [value])
# will be deleted per one task in one housekeeping cycle.
# SQLite3 does not use this parameter, deletes all corresponding rows without a limit.
# If set to 0 then no limit is used at all. In this case you must know what you are doing!
#
# Mandatory: no
# Range: 0-1000000
# Default:
# MaxHousekeeperDelete=500

As you can see it's all defaults, which I would think could be the problem but from the log it seems that far more than the default of 500 rows are being deleted in each cycle?

Comment by Sean Nienaber [ 2018 Nov 29 ]

Hi Vladislavs,

Any thoughts on this?

Thanks,

Comment by Vladislavs Sokurenko [ 2018 Nov 29 ]

It deletes 1 million records per hour and I believe this is happening

How often Zabbix will perform housekeeping procedure (in hours).
Housekeeping is removing outdated information from the database.
Note: To prevent housekeeper from being overloaded (for example, when history and trend periods are greatly reduced), no more than 4 times HousekeepingFrequency hours of outdated information are deleted in one housekeeping cycle, for each item. Thus, if HousekeepingFrequency is 1, no more than 4 hours of outdated information (starting from the oldest entry) will be deleted per cycle.
Note: To lower load on server startup housekeeping is postponed for 30 minutes after server start. Thus, if HousekeepingFrequency is 1, the very first housekeeping procedure after server start will run after 30 minutes, and will repeat with one hour delay thereafter. This postponing behavior is in place since Zabbix 2.4.0.
Since Zabbix 3.0.0 it is possible to disable automatic housekeeping by setting HousekeepingFrequency to 0. In this case the housekeeping procedure can only be started by housekeeper_execute runtime control option and the period of outdated information deleted in one housekeeping cycle is 4 times the period since the last housekeeping cycle, but not less than 4 hours and not greater than 4 days.
See also runtime control options.

Currently there is no indication of a bug, did table sizes decrease since issue was reported ?

Comment by Sean Nienaber [ 2018 Nov 30 ]

Hi Vladislavs,

Unfortunately, it's actually increased.

Table	MB	Rows
trends_uint	21,272	308,960,609
history_uint	19,207	47,716,390
event_recovery	13,756	69,603,359
EVENTS	12,862	168,186,559
trends	2,612	36,170,968
history_text	1,691	1,847,822
history	1,612	3,453,486
history_str	567	3,333,972

What I'm not clear on is why if the housekeeper runs with our settings, we still have data back to 2014?

Is it worth setting MaxHousekeeperDelete to 10,000,000 to get the old records cleared out?

Comment by Vladislavs Sokurenko [ 2018 Nov 30 ]

Sorry but MaxHousekeeperDelete is unrelated.
Does this value change after you execute housekeepr manually ?

select min(clock) from history_uint;

Comment by Sean Nienaber [ 2018 Nov 30 ]

1412783475 - Wednesday, 8 October 2014 15:51:15

This is when we first deployed Zabbix.

Comment by Sean Nienaber [ 2018 Nov 30 ]

Vladislavs, is there any risk in me running a delete on all values in the tables where the clock is less than 1514764800 (2018-01-01)?

Maybe this is the quickest way to resolve the left over data?

Comment by Edgars Melveris [ 2018 Dec 03 ]

Sean, yes, you can do that on the history tables, but if housekeeper has not deleted those old records, this might be a bug.

Or maybe, some items have been configured with very long history?

Can you check the itemid in those old history records and take a look at the item table, to see how long the history should be for those items?

A more recommended solution for removing history would be to monitor the housekeeper process like this:

grep housekeeper /var/log/zabbix/zabbix_server.log

and manually execute housekeeper, when you see that it is idle:

zabbix_server -R housekeeper_execute

It would be really useful for us, to see, if there is a problem with housekeeper. And if those old items will be deleted by housekeeper, if it finishes deleting all the other history

Comment by Glebs Ivanovskis [ 2018 Dec 03 ]

Item configuration should not matter because

we've reduced the history period to 1d in Administration > Housekeeping

Comment by Edgars Melveris [ 2018 Dec 03 ]

The item might have been with that history setting before this was introduced. I am no sure, if housekeeper first deletes the oldest records, if not, it might have not yet gotten to these oldest ones.

Comment by Sean Nienaber [ 2018 Dec 03 ]

Thank you both for your responses.

The override on the history (and events etc.) was a recent edition to try to reduce the DB size ahead of the v3 to v4 upgrade which we have planned.

Here are the last two runs of the housekeeper:

8532:20181203:101202.850 housekeeper [deleted 1056775 hist/trends, 124 items/triggers, 674 events, 53768 problems, 0 sessions, 0 alarms, 451 audit items in 404.063582 sec, idle for 1 hour(s)]

 8532:20181203:111921.769 housekeeper [deleted 1031998 hist/trends, 0 items/triggers, 636 events, 55726 problems, 0 sessions, 0 alarms, 224 audit items in 438.031691 sec, idle for 1 hour(s)]

Each time, just over 1m records are deleted and it runs for about 7min.

Is there a way for it to run for longer?

And to Edgar's point, will it delete the oldest data first, or the newest?

Comment by Glebs Ivanovskis [ 2018 Dec 03 ]

It seems that housekeeper may not delete oldest records of deleted items if they are way older than the oldest records of other items.

 * Comments: If item is added to delete queue, its oldest record timestamp    *
 *           (min_clock) is updated to the calculated 'cutoff' value.         *

vso thanks a lot glebs ! you mean that no longer than a day can be deleted in that case ?

cyclone I mean that when you do select min(clock) from history; you may be looking at deleted items and maybe housekeeper gave up on them. I think it's worth checking oldest records by item:

select itemid,min(clock) as min_clock from history group by itemid order by min_clock asc limit 10;

Then check if these items still exist:

select * from items where itemid in (...);

vso that's interesting, possible cause is described in ZBX-13140

Note that similar problem exists for history, since new history will be added while item remain in cache and this can be after housekeeper was executed. But still history will be cleared after max storage period while events will no longer be cleared.

Comment by Vladislavs Sokurenko [ 2018 Dec 03 ]

You could try something like that to determine itemid with oldest history

select * from history_uint order by clock limit 1;

Then increase log level for housekeeper and see if there are any delete queries for that itemid

zabbix_server -R log_level_increase=housekeeper

Comment by Sean Nienaber [ 2018 Dec 03 ]

Here you go:

8532:20181203:134431.452 forced execution of the housekeeper
 8532:20181203:134431.452 executing housekeeper
 8532:20181203:134431.452 __zbx_zbx_setproctitle() title:'housekeeper [connecting to the database]'
 8532:20181203:134431.453 __zbx_zbx_setproctitle() title:'housekeeper [removing old history and trends]'
 8532:20181203:134813.685 __zbx_zbx_setproctitle() title:'housekeeper [removing old problems]'
 8532:20181203:134814.574 __zbx_zbx_setproctitle() title:'housekeeper [removing old events]'
 8532:20181203:134814.674 __zbx_zbx_setproctitle() title:'housekeeper [removing old sessions]'
 8532:20181203:134814.675 __zbx_zbx_setproctitle() title:'housekeeper [removing old service alarms]'
 8532:20181203:134814.675 __zbx_zbx_setproctitle() title:'housekeeper [removing old audit log items]'
 8532:20181203:134814.683 __zbx_zbx_setproctitle() title:'housekeeper [removing deleted items data]'
 8532:20181203:134814.683 query [txnlev:0] [select housekeeperid,tablename,field,value from housekeeper where tablename in ('history','history_log','history_str','history_text','history_uint','trends','trends_uint','events') order by tablename]
 8532:20181203:134814.684 housekeeper [deleted 324064 hist/trends, 0 items/triggers, 539 events, 11532 problems, 0 sessions, 0 alarms, 48 audit items in 223.230491 sec, idle for 1 hour(s)]
 8532:20181203:134814.684 __zbx_zbx_setproctitle() title:'housekeeper [deleted 324064 hist/trends, 0 items/triggers, 539 events, 0 sessions, 0 alarms, 48 audit items in 223.230491 sec, idle for 1 hour(s)]'

select itemid,min(clock) as min_clock from history group by itemid order by min_clock asc limit 10;

26924	1538909324
26936	1538909336
158300	1538909393
27032	1538909440
27110	1538909518
27113	1538909521
27116	1538909524
27119	1538909527
27122	1538909529
31953	1538909554

select * from items where itemid in (26924,26936,158300,27032,27110,27113,27116,27119,27122,31953);

26924	0			10121	Free disk space on l: in %	vfs.fs.size[l:,pfree]	10m	90d	365d		%	\N	\N		0	3	30d	\N
26936	0	public	interfaces.ifTable.ifEntry.ifInOctets.1	10121	Free disk space on m: in %	vfs.fs.size[m:,pfree]	10m	90d	365d		%	\N	\N	last("vfs.fs.size[m:,free]")/last("vfs.fs.size[m:,total]")	0	3	30d	\N
27032	0	public	interfaces.ifTable.ifEntry.ifInOctets.1	10194	Free disk space on e: in %	vfs.fs.size[e:,pfree]	10m	90d	365d		%	\N	\N	last("vfs.fs.size[e:,free]")/last("vfs.fs.size[e:,total]")	0	59	30d	\N
27110	0	public	interfaces.ifTable.ifEntry.ifInOctets.1	10199	Free disk space on e: in %	vfs.fs.size[e:,pfree]	10m	90d	365d		%	\N	\N	last("vfs.fs.size[e:,free]")/last("vfs.fs.size[e:,total]")	0	64	30d	\N
27113	0			10199	Free disk space on f: in %	vfs.fs.size[f:,pfree]	10m	90d	365d	localhost	%	\N	\N	last("vfs.fs.size[f:,free]")/last("vfs.fs.size[f:,total]")	0	64	30d	\N
27116	0	public	interfaces.ifTable.ifEntry.ifInOctets.1	10199	Free disk space on g: in %	vfs.fs.size[g:,pfree]	10m	90d	365d		%	\N	\N	last("vfs.fs.size[g:,free]")/last("vfs.fs.size[g:,total]")	0	64	30d	\N
27119	0			10199	Free disk space on h: in %	vfs.fs.size[h:,pfree]	10m	90d	365d	localhost	%	\N	\N	last("vfs.fs.size[h:,free]")/last("vfs.fs.size[h:,total]")	0	64	30d	\N
27122	0			10199	Free disk space on i: in %	vfs.fs.size[i:,pfree]	10m	90d	365d	localhost	%	\N	\N	last("vfs.fs.size[i:,free]")/last("vfs.fs.size[i:,total]")	0	64	30d	\N
31953	0			10084	Text Magic Balance	textmagic.balance	10m	90d	365d			\N	\N		0	1	30d	\N
158300	10			10807	Enclosure[1] Power	hp-msa.pl[\{$MSA1},\{$USER},\{$PASSWORD},"data", "enclosure", 1, "enclosure-power"]	180	90d	365d		W	\N	\N		4	905	30d	\N

select * from hosts where hostid in (10121,10121,10194,10199,10199,10199,10199,10199,10084,10807);

10084	\N	VIAMON01	0		1	-1	2	0	\N	VIAMON01	\N	1	1
10121	\N	VIAEX01	0		1	-1	2	0	\N	VIAEX01	\N	1	1
10194	11241	KRTHAYDC02	0		1	0	2	0	\N	KRTHAYDC02	\N	1	1
10199	11241	KRTHAYEXCH01	1	Get value from agent failed: cannot connect to [[192.168.0.9]:10050]: [113] No route to host	2	0	2	0	\N	KRTHAYEXCH01	\N	1	1
10807	10786	SeligUK-SAN01	0		0	-1	2	1	\N	SeligUK-SAN01	\N	1	1

Comment by Sean Nienaber [ 2018 Dec 03 ]

Hi guys,

I was just thinking, how big problem is there and where exactly is the issue. I.e. we know there is data old than the housekeeper override, but how much.

Here's my query to work this out:

select 
 (select count(*) from history where clock < 1543590391) as 'history count', -- 3 days
 (select count(*) from history_uint where clock < 1543590391) as 'history_uint count', -- 3 days
 (select count(*) from history_text where clock < 1543590391) as 'history_text count', -- 3 days
 (select count(*) from events where clock < 1533308791) as 'events count', -- 4 months
 (select count(*) from trends where clock < 1512227191) as 'trends count', -- 1 year and 1 day
 (select count(*) from trends_uint where clock < 1512227191) as 'trends_uint count' -- 1 year and 1 day
from dual

Values are just a little longer than configured in the overrides:

history count	156,109
history_uint count	19,509,603
history_text count	1,139,830
events count	24,750,112
trends count	1,466
trends_uint count	7,609

So if we could address the history_uint and events tables, that would make a big difference.

Comment by Vladislavs Sokurenko [ 2018 Dec 04 ]

I am sorry, but could you please do query that cyclone provided for all history tables ?

select itemid,min(clock) as min_clock from history_uint group by itemid order by min_clock asc limit 10;
select itemid,min(clock) as min_clock from trends_uint group by itemid order by min_clock asc limit 10;

Thanks allot !

Comment by Sean Nienaber [ 2018 Dec 04 ]

Here you go:

select itemid,min(clock) as min_clock from history_uint group by itemid order by min_clock asc limit 10;

27735	1412783475
27736	1412783476
27738	1412783478
27729	1412783529
27730	1412783770
27731	1412783771
27746	1412786786
27728	1412789168
27740	1412801780
27732	1412814071

select itemid,min(clock) as min_clock from trends_uint group by itemid order by min_clock asc limit 10;

23943	1412769600
23944	1412769600
23946	1412769600
27729	1412780400
27730	1412780400
27731	1412780400
27735	1412780400
27736	1412780400
27738	1412780400
27630	1412784000

Comment by Vladislavs Sokurenko [ 2018 Dec 04 ]

Do those items exist ?

Comment by Sean Nienaber [ 2018 Dec 04 ]

Here you go:

select * from items where itemid in (23943,23944,23946,27630,27728,27729,27730,27731,27732,27735,27736,27738,27740,27746)

27728	10197	Total % CPU	perf_counter["\Processor(_Total)\% Processor Time"]	5m	1w	365d	%	27727	\N	62	30d	\N
27729	10196	Total % CPU	perf_counter["\Processor(_Total)\% Processor Time"]	5m	1w	365d	%	27727	\N	61	30d	\N
27730	10194	Total % CPU	perf_counter["\Processor(_Total)\% Processor Time"]	5m	1w	365d	%	27727	\N	59	30d	\N
27731	10192	Total % CPU	perf_counter["\Processor(_Total)\% Processor Time"]	5m	1w	365d	%	27727	\N	57	30d	\N
27732	10199	Total % CPU	perf_counter["\Processor(_Total)\% Processor Time"]	5m	1w	365d	%	27727	\N	64	30d	\N
27735	10185	Total % CPU	perf_counter["\Processor(_Total)\% Processor Time"]	5m	1w	365d	%	27727	\N	50	30d	\N
27736	10200	Total % CPU	perf_counter["\Processor(_Total)\% Processor Time"]	5m	1w	365d	%	27727	\N	65	30d	\N
27738	10198	Total % CPU	perf_counter["\Processor(_Total)\% Processor Time"]	5m	1w	365d	%	27727	\N	63	30d	\N
27740	10201	Total % CPU	perf_counter["\Processor(_Total)\% Processor Time"]	5m	1w	365d	%	27727	\N	66	30d	\N
27746	10189	Total % CPU	perf_counter["\Processor(_Total)\% Processor Time"]	5m	1w	365d	%	27727	\N	54	30d	\N

vso I am sorry it's very hard to understand, could you please select one value for example:

select * from items where itemid=27728\G

Comment by Vladislavs Sokurenko [ 2018 Dec 05 ]

Could you please do:

show create table items;
select * from hosts where hostid=10197;

Comment by Sean Nienaber [ 2018 Dec 05 ]

Here you go:

show create table items;

"items" "CREATE TABLE `items` (
 `itemid` bigint(20) unsigned NOT NULL,
 `type` int(11) NOT NULL DEFAULT '0',
 `snmp_community` varchar(64) COLLATE utf8mb4_bin NOT NULL DEFAULT '',
 `snmp_oid` varchar(512) COLLATE utf8_bin NOT NULL DEFAULT '',
 `hostid` bigint(20) unsigned NOT NULL,
 `name` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '',
 `key_` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '',
 `delay` varchar(1024) COLLATE utf8_bin NOT NULL DEFAULT '0',
 `history` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '90d',
 `trends` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '365d',
 `status` int(11) NOT NULL DEFAULT '0',
 `value_type` int(11) NOT NULL DEFAULT '0',
 `trapper_hosts` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '',
 `units` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '',
 `snmpv3_securityname` varchar(64) COLLATE utf8_bin NOT NULL DEFAULT '',
 `snmpv3_securitylevel` int(11) NOT NULL DEFAULT '0',
 `snmpv3_authpassphrase` varchar(64) COLLATE utf8_bin NOT NULL DEFAULT '',
 `snmpv3_privpassphrase` varchar(64) COLLATE utf8_bin NOT NULL DEFAULT '',
 `formula` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '',
 `error` varchar(2048) COLLATE utf8_bin NOT NULL DEFAULT '',
 `lastlogsize` bigint(20) unsigned NOT NULL DEFAULT '0',
 `logtimefmt` varchar(64) COLLATE utf8_bin NOT NULL DEFAULT '',
 `templateid` bigint(20) unsigned DEFAULT NULL,
 `valuemapid` bigint(20) unsigned DEFAULT NULL,
 `params` text COLLATE utf8_bin NOT NULL,
 `ipmi_sensor` varchar(128) COLLATE utf8_bin NOT NULL DEFAULT '',
 `authtype` int(11) NOT NULL DEFAULT '0',
 `username` varchar(64) COLLATE utf8_bin NOT NULL DEFAULT '',
 `password` varchar(64) COLLATE utf8_bin NOT NULL DEFAULT '',
 `publickey` varchar(64) COLLATE utf8_bin NOT NULL DEFAULT '',
 `privatekey` varchar(64) COLLATE utf8_bin NOT NULL DEFAULT '',
 `mtime` int(11) NOT NULL DEFAULT '0',
 `flags` int(11) NOT NULL DEFAULT '0',
 `interfaceid` bigint(20) unsigned DEFAULT NULL,
 `port` varchar(64) COLLATE utf8_bin NOT NULL DEFAULT '',
 `description` text COLLATE utf8_bin NOT NULL,
 `inventory_link` int(11) NOT NULL DEFAULT '0',
 `lifetime` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '30d',
 `snmpv3_authprotocol` int(11) NOT NULL DEFAULT '0',
 `snmpv3_privprotocol` int(11) NOT NULL DEFAULT '0',
 `state` int(11) NOT NULL DEFAULT '0',
 `snmpv3_contextname` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '',
 `evaltype` int(11) NOT NULL DEFAULT '0',
 `jmx_endpoint` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '',
 `master_itemid` bigint(20) unsigned DEFAULT NULL,
 PRIMARY KEY (`itemid`),
 UNIQUE KEY `items_1` (`hostid`,`key_`),
 KEY `items_3` (`status`),
 KEY `items_4` (`templateid`),
 KEY `items_5` (`valuemapid`),
 KEY `items_6` (`interfaceid`),
 KEY `items_7` (`master_itemid`),
 CONSTRAINT `c_items_1` FOREIGN KEY (`hostid`) REFERENCES `hosts` (`hostid`) ON DELETE CASCADE,
 CONSTRAINT `c_items_2` FOREIGN KEY (`templateid`) REFERENCES `items` (`itemid`) ON DELETE CASCADE,
 CONSTRAINT `c_items_3` FOREIGN KEY (`valuemapid`) REFERENCES `valuemaps` (`valuemapid`),
 CONSTRAINT `c_items_4` FOREIGN KEY (`interfaceid`) REFERENCES `interface` (`interfaceid`),
 CONSTRAINT `c_items_5` FOREIGN KEY (`master_itemid`) REFERENCES `items` (`itemid`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_bin"

select * from hosts where hostid=10197;

10197

11241

KRTHAYDC01

Comment by Sean Nienaber [ 2018 Dec 05 ]

I'm sorry, I don't understand?

Comment by Vladislavs Sokurenko [ 2018 Dec 05 ]

Please execute following query
select * from items where itemid=27728\G
Also it would be nice to have log level increased for housekeeper and attaching log

Comment by Sean Nienaber [ 2018 Dec 05 ]

The logging was increased previously, unless you mean increasing it further? Here's the grep'd log:housekeeper.txt

select * from items where itemid=27728

27728

10197

Total % CPU

perf_counter["\Processor(_Total)\% Processor Time"]

365d

27727

30d

Comment by Vladislavs Sokurenko [ 2018 Dec 05 ]

I mean exactly

select * from items where itemid=27728\G

with \G in the end, otherwise it’s not readable

Comment by Vladislavs Sokurenko [ 2018 Dec 05 ]

Also you can try grep in log for itemid 27728

Comment by Sean Nienaber [ 2018 Dec 05 ]

Right, I'm running these from HeidiSQL, here you go:

mysql> select * from items where itemid=27728\G;
*************************** 1. row ***************************
 itemid: 27728
 type: 0
 snmp_community:
 snmp_oid:
 hostid: 10197
 name: Total % CPU
 key_: perf_counter["\Processor(_Total)\% Processor Time"]
 delay: 5m
 history: 1w
 trends: 365d
 status: 0
 value_type: 0
 trapper_hosts:
 units: %
 snmpv3_securityname:
 snmpv3_securitylevel: 0
snmpv3_authpassphrase:
snmpv3_privpassphrase:
 formula:
 error:
 lastlogsize: 0
 logtimefmt:
 templateid: 27727
 valuemapid: NULL
 params:
 ipmi_sensor:
 authtype: 0
 username:
 password:
 publickey:
 privatekey:
 mtime: 0
 flags: 0
 interfaceid: 62
 port:
 description:
 inventory_link: 0
 lifetime: 30d
 snmpv3_authprotocol: 0
 snmpv3_privprotocol: 0
 state: 0
 snmpv3_contextname:
 evaltype: 0
 jmx_endpoint:
 master_itemid: NULL
1 row in set (0.00 sec)

27728 wasn't found in the log but I had debug at 3 and a file size of 100MB so the log file was truncating too quickly. Now set debug to 2, increased log file to 500MB and increase debug on housekeeper only.

I'll check it in the morning again.

Comment by Vladislavs Sokurenko [ 2018 Dec 05 ]

I see that history for itemid 27728 has a timestamp of 10/08/2014 while item is still present for it as well as host, both item and host status is 0 which is OK.

Can you please grep log to see if there are such entries at all:

delete from history_uint

Comment by Sean Nienaber [ 2018 Dec 06 ]

I've increased the housekeeper log to 4 and manually started the housekeeper and yes I see many:

14157:20181206:090209.429 query without transaction detected
 14157:20181206:090209.429 query [txnlev:0] [delete from history_uint where itemid=57205 and clock<1544000463]
 14157:20181206:090209.430 query without transaction detected
 14157:20181206:090209.430 query [txnlev:0] [delete from history_uint where itemid=57206 and clock<1544000463]
 14157:20181206:090209.430 query without transaction detected
 14157:20181206:090209.430 query [txnlev:0] [delete from history_uint where itemid=57342 and clock<1544000463]
 14157:20181206:090209.434 query without transaction detected
 14157:20181206:090209.434 query [txnlev:0] [delete from history_uint where itemid=57343 and clock<1544000463]
 14157:20181206:090209.436 query without transaction detected
 14157:20181206:090209.436 query [txnlev:0] [delete from history_uint where itemid=57344 and clock<1544000463]
 14157:20181206:090209.486 query without transaction detected

Comment by Sean Nienaber [ 2018 Dec 06 ]

I just noticed this at the end of the housekeeping cycle:

After kicking off the housekeeper again without waiting for the next cycle in an house, here's the result:

14157:20181206:090744.631 query [txnlev:0] [select eventid from events where clock<1516092250 and events.source=0 and events.object=0 and not exists (select null from problem where events.eventid=problem.eventid) and not exists (select null from problem where events.eventid=problem.r_eventid) order by eventid limit 5000]
...
14157:20181206:090744.844 housekeeper [deleted 731569 hist/trends, 0 items/triggers, 536 events, 35955 problems, 0 sessions, 0 alarms, 48 audit items in 401.789149 sec, idle for 1 hour(s)]
14157:20181206:090744.844 __zbx_zbx_setproctitle() title:'housekeeper [deleted 731569 hist/trends, 0 items/triggers, 536 events, 0 sessions, 0 alarms, 48 audit items in 401.789149 sec, idle for 1 hour(s)]'

Looking at the query, it limits to 5000, does this mean that a maximum of 5000 events will be deleted in each house keeping cycle?

We have 24m events records older than the override setting of 90 days, could this be why they aren't being cleared out? Here's the tail end of the log for your reference: housekeeper.txt

^{Though I note only 536 events were deleted. Why aren't far more events being deleted?}

Comment by Vladislavs Sokurenko [ 2018 Dec 06 ]

I am sorry, please also do, this would give clear picture of item and host that for some reason don't get history deleted:

select * from hosts where hostid=10197\G
select count(*),min(clock),max(clock) from history_uint where itemid=27728\G

Comment by Sean Nienaber [ 2018 Dec 06 ]

mysql> select * from hosts where hostid=10197\G;
*************************** 1. row ***************************
 hostid: 10197
 proxy_hostid: 11241
 host: KRTHAYDC01
 status: 0
 disable_until: 0
 error:
 available: 1
 errors_from: 0
 lastaccess: 0
 ipmi_authtype: 0
 ipmi_privilege: 2
 ipmi_username:
 ipmi_password:
ipmi_disable_until: 0
 ipmi_available: 0
snmp_disable_until: 0
 snmp_available: 0
 maintenanceid: NULL
maintenance_status: 0
 maintenance_type: 0
 maintenance_from: 0
 ipmi_errors_from: 0
 snmp_errors_from: 0
 ipmi_error:
 snmp_error:
 jmx_disable_until: 0
 jmx_available: 0
 jmx_errors_from: 0
 jmx_error:
 name: KRTHAYDC01
 flags: 0
 templateid: NULL
 description:
 tls_connect: 1
 tls_accept: 1
 tls_issuer:
 tls_subject:
 tls_psk_identity:
 tls_psk:
1 row in set (0.00 sec)
mysql> select count( * ),min(clock),max(clock) from history_uint where itemid=27728\G;
*************************** 1. row ***************************
count( * ): 2
min(clock): 1412789168
max(clock): 1412810168
1 row in set (0.00 sec)

Comment by Vladislavs Sokurenko [ 2018 Dec 06 ]

I am a little bit confused because there are only 2 values, this could mean that housekeeper cleans up them every hour but for some reason they come from agent with such old timestamp.
Is it possible ?

Comment by Sean Nienaber [ 2018 Dec 07 ]

Hi Vladislavs, what shall I check/do next to troubleshoot this?

Comment by Vladislavs Sokurenko [ 2018 Dec 07 ]

You could try query below to determine the proxy and then check if there is anything wrong with time on that proxy.

select * from hosts where hostid=11241\G;

Comment by Sean Nienaber [ 2018 Dec 07 ]

Looks fine to me:

mysql> select from_unixtime(lastaccess) as 'DateTime' from hosts where hostid=11241\G;
*************************** 1. row ***************************
DateTime: 2018-12-07 11:27:52
1 row in set (0.00 sec)
 
mysql> select * from hosts where hostid=11241\G;
*************************** 1. row ***************************
 hostid: 11241
 proxy_hostid: NULL
 host: KRT-PROXY01
 status: 5
 disable_until: 0
 error:
 available: 0
 errors_from: 0
 lastaccess: 1544182084
 ipmi_authtype: 0
 ipmi_privilege: 2
 ipmi_username:
 ipmi_password:
ipmi_disable_until: 0
 ipmi_available: 0
snmp_disable_until: 0
 snmp_available: 0
 maintenanceid: NULL
maintenance_status: 0
 maintenance_type: 0
 maintenance_from: 0
 ipmi_errors_from: 0
 snmp_errors_from: 0
 ipmi_error:
 snmp_error:
 jmx_disable_until: 0
 jmx_available: 0
 jmx_errors_from: 0
 jmx_error:
 name:
 flags: 0
 templateid: NULL
 description:
 tls_connect: 1
 tls_accept: 1
 tls_issuer:
 tls_subject:
 tls_psk_identity:
 tls_psk:
1 row in set (0.00 sec)
mysql> select now();
+---------------------+
| now() |
+---------------------+
| 2018-12-07 11:28:23 |
+---------------------+
1 row in set (0.00 sec)
mysql>

Comment by Vladislavs Sokurenko [ 2018 Dec 07 ]

I mean with fuzzytime item as an example

Comment by Sean Nienaber [ 2018 Dec 07 ]

I don't understand, please explain.

Comment by Sean Nienaber [ 2018 Dec 10 ]

Vladislavs, can you explain what you mean by fuzzytime in the context of the query?

Comment by Vladislavs Sokurenko [ 2018 Dec 10 ]

I mean fuzzytime zabbix item, or system.localtime do you have access to that proxy to see if it has correct time ?

Comment by Sean Nienaber [ 2018 Dec 11 ]

Right, I understand.

The time on KRTHAYPROXY01 is about 20s out:

Name/Key	Last Check	Last Value
Host local time	2018-12-11 08:28:58	2018-12-11 08:28:36

Comment by Vladislavs Sokurenko [ 2018 Dec 11 ]

I believe it should have been this proxy:
KRT-PROXY01

It is strange that there are only 2 values and with old timestamps in database for the item with itemid 27728 I think they are received with incorrect timestamp, can you confirm this ?

Comment by Sean Nienaber [ 2018 Dec 11 ]

Hmmm, looks OK to me:

select count(*) from history where itemid = 27728;
301

select from_unixtime(min(clock)), from_unixtime(max(clock)) from history where itemid = 27728;
"2018-12-10 07:52:26" "2018-12-11 08:52:29"

Comment by Vladislavs Sokurenko [ 2018 Dec 11 ]

Previously it was not fine, here is what previously was reported:

mysql> select count( * ),min(clock),max(clock) from history_uint where itemid=27728\G;
*************************** 1. row ***************************
count( * ): 2
min(clock): 1412789168
max(clock): 1412810168
1 row in set (0.00 sec)

Which is 10/08/2014 @ 5:26pm (UTC)

Do you still have old history entries in database or issue no longer occurs ?

Comment by Sean Nienaber [ 2018 Dec 11 ]

Apologies, I was checking history, opposed to history_uint.

The values above do still exist in history_uint.

The proxy KRT-PROXY01 is actually host KRTHAYPROXY01.

Comment by Vladislavs Sokurenko [ 2018 Dec 11 ]

There are only 2 history values for this item that is supposed to be polled every 5 minutes.
Those values are for type history_uint while new values come as float.

It means that at some point of 2014 you have changed item from history_uint to history and zabbix now only clear history, while history_uint contains old data.

I will try to reproduce it locally but issues seems clear now.

Comment by Sean Nienaber [ 2018 Dec 11 ]

Correct, the table with issues appears to be history_uint and not history.

The events table is also an issue, could these two be related?

Comment by Vladislavs Sokurenko [ 2018 Dec 11 ]

It should not be related to events table at all, events are only cleared if corresponding problem is resolved, is it possible that you have lots of unresolved problems ?

Comment by Sean Nienaber [ 2018 Dec 11 ]

We currently have 69 triggers in a problem state.

And we have 24 million events older than the override set in housekeeping.

vso in this case please increase MaxHousekeeperDelete or set to 0 and see how it goes.

Comment by Sean Nienaber [ 2018 Dec 11 ]

What is a reasonable / safe setting to start out?

vso this question could be addressed to our support unfortunately I cannot answer, would try 10000 though and see how long it took for housekeeper to delete those with ps -ax | grep housekeeper then increase accordingly.

Comment by Sean Nienaber [ 2018 Dec 11 ]

I set MaxHousekeeperDelete to 10000 as suggested and manually ran the housekeeper with the following result:

22630:20181211:113640.100 housekeeper [deleted 792312 hist/trends, 17113 items/triggers, 506 events, 33383 problems, 0 sessions, 0 alarms, 48 audit items in 2570.851989 sec, idle for 1 hour(s)]
 22630:20181211:113640.100 __zbx_zbx_setproctitle() title:'housekeeper [deleted 792312 hist/trends, 17113 items/triggers, 506 events, 0 sessions, 0 alarms, 48 audit items in 2570.851989 sec, idle for 1 hour(s)]'

Nope only 506 events were deleted. I'm not sure what the setting does as far more than 10,000 values are being deleted, 792,312 history/trends!

Is this setting being ignored?

Comment by Sean Nienaber [ 2018 Dec 17 ]

Hi guys,

I've done another test, I set the MaxHousekeeperDelete to a value of 10 and the number of records deleted didn't change, leading me to believe this setting is being ignored, perhaps another bug?

MaxHousekeeperDelete=10000
22630:20181217:085739.469 housekeeper [deleted 1080346 hist/trends, 0 items/triggers, 694 events, 41746 problems, 0 sessions, 0 alarms, 48 audit items in 364.681259 sec, idle for 1 hour(s)]

MaxHousekeeperDelete=10 
20258:20181217:101436.909 housekeeper [deleted 810618 hist/trends, 0 items/triggers, 540 events, 35957 problems, 0 sessions, 0 alarms, 48 audit items in 2121.095324 sec, idle for 1 hour(s)]

Comment by Edgars Melveris [ 2018 Dec 17 ]

MaxHousekeeperDelete is only relevant to items, that have been removed, so, for those items all history must be removed. And no more, than the value in this setting will be removed. (For safety, if you remove a large number of items on a big instance, housekeeper could run for very long time). This does not effect records, which should be deleted, because of history settings.

Comment by Sean Nienaber [ 2018 Dec 17 ]

Understood, thank you.

So we're still not sure why the history and events aren't being deleted?

Comment by Edgars Melveris [ 2018 Dec 17 ]

I would suggest running the houskeeper manually for some time, there is another, hard-coded, limit to how much actual history it will delete in one attempt.

Comment by Sean Nienaber [ 2018 Dec 17 ]

We have 24 million events records and 19 million history_uint records which should be there.

Each run it only deleted around 500 events and 800,000 history/trends, so that's not a workable solution.

Do we need to get to the bottom of why they data is there and not being deleted or could we simply delete it directly in MySQL?

Comment by Sean Nienaber [ 2018 Dec 19 ]

Thoughts on if we can simply delete the old data which isn't being cleared out?

Comment by Sean Nienaber [ 2019 Jan 03 ]

Hi all,

Where are we on this issue?

In the meantime, could you provide advice on clearing out the old data so that we can upgrade our Zabbix instance?

Thanks,

Comment by Vladislavs Sokurenko [ 2019 Jan 03 ]

You had 150 million events so this number slowly decreases ?

Comment by Sean Nienaber [ 2019 Jan 04 ]

I ran the following query to show which tables aren't being cleared out by the housekeeper in-line with the housekeeper settings:

select
 (select count(*) from history where clock < UNIX_TIMESTAMP(now() - interval 3 day)) as 'history count', -- 3 days
 (select count(*) from history_uint where clock < UNIX_TIMESTAMP(now() - interval 3 day)) as 'history_uint count', -- 3 days
 (select count(*) from history_text where clock < UNIX_TIMESTAMP(now() - interval 3 day)) as 'history_text count', -- 3 days
 (select count(*) from events where clock < UNIX_TIMESTAMP(now() - interval 4 month)) as 'events count', -- 4 months
 (select count(*) from trends where clock < UNIX_TIMESTAMP(now() - interval 366 day)) as 'trends count', -- 1 year and 1 day
 (select count(*) from trends_uint where clock < UNIX_TIMESTAMP(now() - interval 366 day)) as 'trends_uint count' -- 1 year and 1 day
from dual\G;

The results from yesterday:

history count: 0
history_uint count: 1376
history_text count: 8833
 events count: 49135864
 trends count: 1470
 trends_uint count: 7611

And the results this morning:

history count: 0
history_uint count: 1376
history_text count: 8833
 events count: 50031213
 trends count: 1470
 trends_uint count: 7611

As you can see, there are 50 million records in the event table which aren't being cleared out.

The total count on the event table is 247 million, is this normal? It's seems really high.

The event_recovery table is also massive at 123 million records.

Comment by Vladislavs Sokurenko [ 2019 Jan 04 ]

Are those events open problems ? Only closed problems are cleared.
select count from problem where r_eventid is null;

Comment by Sean Nienaber [ 2019 Jan 04 ]

If I browse to Monitoring > Problems, and leave all configuration blank, there are 63 problems reported, not 247 million:

mysql> select count from problem where r_eventid is null\G;

- - - - 1. row ***************************
        count: 2947
        1 row in set (0.00 sec)

Comment by Vladislavs Sokurenko [ 2019 Jan 04 ]

does min clock in events slowly increase ?
select min(clock) from events;

Also what happened to history tables ? Did they eventually get cleaned ?

Comment by Sean Nienaber [ 2019 Jan 04 ]

mysql> select count(*) from problem where r_eventid is null\G;
*************************** 1. row ***************************
count(*): 2947
1 row in set (0.00 sec)

Comment by Sean Nienaber [ 2019 Jan 04 ]

Here are the last two runs of the housekeeper with the count run before and after, the clock isn't increasing:

20258:20190104:102713.745 executing housekeeper
20258:20190104:103150.262 housekeeper [deleted 964550 hist/trends, 20 items/triggers, 657 events, 772 problems, 0 sessions, 0 alarms, 48 audit items in 276.516237 sec, idle for 1 hour(s)]

mysql> select min(clock) from events;
+------------+
| min(clock) |
+------------+
| 1506511555 |
+------------+
1 row in set (2 min 41.31 sec)

20258:20190104:113150.978 executing housekeeper
20258:20190104:113629.978 housekeeper [deleted 963810 hist/trends, 20 items/triggers, 824 events, 2704 problems, 0 sessions, 0 alarms, 48 audit items in 278.998533 sec, idle for 1 hour(s)]

mysql> select min(clock) from events;
+------------+
| min(clock) |
+------------+
| 1506511555 |
+------------+
1 row in set (1 min 44.04 sec)

The history table is looking far better now, here are the counts outside the housekeeper settings for yesterday, note this excludes the event_recovery table:

history count: 0
history_uint count: 1376
history_text count: 8833
events count: 50031213
trends count: 1470
trends_uint count: 7611

Comment by Vladislavs Sokurenko [ 2019 Jan 04 ]

Please execute this:
select e.eventid from events e left join problem p on p.eventid=e.eventid where e.clock=1506511555;

Comment by Sean Nienaber [ 2019 Jan 04 ]

Here you go:

mysql> select e.eventid from events e left join problem p on p.eventid=e.eventid where e.clock=1506511555;
+----------+
| eventid |
+----------+
| 35986830 |
+----------+
1 row in set (2 min 3.63 sec)

Comment by Vladislavs Sokurenko [ 2019 Jan 04 ]

select * from problem where eventid=35986830;

Comment by Sean Nienaber [ 2019 Jan 04 ]

Here you go:

mysql> select * from problem where eventid=35986830\G;
*************************** 1. row ***************************
eventid: 35986830
source: 0
object: 0
objectid: 47130
clock: 1506511555
ns: 159548872
r_eventid: NULL
r_clock: 0
r_ns: 0
correlationid: NULL
userid: NULL
1 row in set (0.04 sec)

Comment by Vladislavs Sokurenko [ 2019 Jan 04 ]

I see no reason why it wouldn't get deleted on 3.0.24, could you please provide screenshot of housekeeper settings ?

Comment by Sean Nienaber [ 2019 Jan 04 ]

Note as stated originally, we're running v3.4.14.

vadmin@VIAMON01:~$ cat /etc/zabbix/zabbix_server.conf | grep -i house
### Option: HousekeepingFrequency
# How often Zabbix will perform housekeeping procedure (in hours).
# Housekeeping is removing unnecessary information from history, alert, and alarms tables.
# HousekeepingFrequency=1
### Option: MaxHousekeeperDelete
# The table "housekeeper" contains "tasks" for housekeeping procedure in the format:
# [housekeeperid], [tablename], [field], [value].
# No more than 'MaxHousekeeperDelete' rows (corresponding to [tablename], [field], [value])
# will be deleted per one task in one housekeeping cycle.
MaxHousekeeperDelete=10

The MaxHousekeeperDelete value hasn't made any difference at any point.

Comment by Vladislavs Sokurenko [ 2019 Jan 04 ]

Did you restart server after changing MaxHousekeeperDelete ? I would check the log and look for "select from events" queries or "In housekeeping_process_rule() table:" keyword

From "In housekeeping_process_rule() table" keyword you can see the progress, for example in your log

14157:20181206:090744.631 In housekeeping_process_rule() table:'events' field_name:'eventid' filter:'events.source=0 and events.object=0 and not exists (select null from problem where events.eventid=problem.eventid) and not exists (select null from problem where events.eventid=problem.r_eventid)' min_clock:1516077850 now:1544086863

means that it currently delete till Tue Jan 16 06:44:10 EET 2018 while time was Thu Dec 6 11:01:03 EET 2018
So you need to wait until it reaches that time

Comment by Sean Nienaber [ 2019 Jan 04 ]

I've increases debug to 4 and started the housekeeper.

In case it's relevant, I see loads of these:

20258:20190104:132441.402 query without transaction detected
 20258:20190104:132441.402 query [txnlev:0] [delete from history_uint where itemid=31747 and clock<1546521849]
 20258:20190104:132441.405 query without transaction detected
 20258:20190104:132441.405 query [txnlev:0] [delete from history_uint where itemid=31748 and clock<1546521849]
 20258:20190104:132441.409 query without transaction detected
 20258:20190104:132441.409 query [txnlev:0] [delete from history_uint where itemid=31749 and clock<1546521849]

And here are the events logs, nowhere near the min clock that actually exists:

20258:20190104:132846.131 __zbx_zbx_setproctitle() title:'housekeeper [removing old events]'
 20258:20190104:132846.131 In housekeeping_process_rule() table:'events' field_name:'eventid' filter:'events.source=0 and events.object=0 and not exists (select null from problem where events.eventid=problem.eventid) and not exists (select null from problem where events.eventid=problem.r_eventid)' min_clock:1525337644 now:1546608249
 20258:20190104:132846.766 End of housekeeping_process_rule():566
 20258:20190104:132846.767 In housekeeping_process_rule() table:'events' field_name:'eventid' filter:'events.source=3 and events.object=0 and not exists (select null from problem where events.eventid=problem.eventid) and not exists (select null from problem where events.eventid=problem.r_eventid)' min_clock:1525341931 now:1546608249
 20258:20190104:132846.922 End of housekeeping_process_rule():61
 20258:20190104:132846.922 In housekeeping_process_rule() table:'events' field_name:'eventid' filter:'events.source=3 and events.object=4 and not exists (select null from problem where events.eventid=problem.eventid) and not exists (select null from problem where events.eventid=problem.r_eventid)' min_clock:1525340440 now:1546608249
 20258:20190104:132846.948 End of housekeeping_process_rule():26
 20258:20190104:132846.949 In housekeeping_process_rule() table:'events' field_name:'eventid' filter:'events.source=3 and events.object=5 and not exists (select null from problem where events.eventid=problem.eventid) and not exists (select null from problem where events.eventid=problem.r_eventid)' min_clock:1537132309 now:1546608249
 20258:20190104:132846.952 End of housekeeping_process_rule():0
 20258:20190104:132846.952 In housekeeping_process_rule() table:'events' field_name:'eventid' filter:'events.source=1 and events.object=1' min_clock:1546000590 now:1546608249
 20258:20190104:132846.953 End of housekeeping_process_rule():0
 20258:20190104:132846.953 In housekeeping_process_rule() table:'events' field_name:'eventid' filter:'events.source=1 and events.object=2' min_clock:1546000590 now:1546608249
 20258:20190104:132846.957 End of housekeeping_process_rule():0
 20258:20190104:132846.957 In housekeeping_process_rule() table:'events' field_name:'eventid' filter:'events.source=2 and events.object=3' min_clock:1530554413 now:1546608249
 20258:20190104:132846.958 End of housekeeping_process_rule():0

End result:

20258:20190104:132847.052 housekeeper [deleted 730382 hist/trends, 20 items/triggers, 653 events, 779 problems, 0 sessions, 0 alarms, 54 audit items in 277.774080 sec, idle for 1 hour(s)]
 20258:20190104:132847.052 __zbx_zbx_setproctitle() title:'housekeeper [deleted 730382 hist/trends, 20 items/triggers, 653 events, 0 sessions, 0 alarms, 54 audit items in 277.774080 sec, idle for 1 hour(s)]'

Comment by Vladislavs Sokurenko [ 2019 Jan 04 ]

Now min clock is Thu May 3 11:54:04 EEST 2018 while it was Tue Jan 16 06:44:10 EET 2018 so it's slowly moving forward.

Comment by Sean Nienaber [ 2019 Jan 04 ]

So with 247 million records in the table, at an average of 747 per 1 hour cycle over the past 50 cycles, it's going to take 38 years to clear that down assuming new data doesn't arrive.

Surely this can't be correct?

247,157,348 Records
748 Records per hour deleted
330,477 Hours
13,770 Days
459 Months
38 Years

Comment by Vladislavs Sokurenko [ 2019 Jan 04 ]

It deletes 4 hours worth of events per hour so it can take long but not years, otherwise database will be overloaded.

Comment by Sean Nienaber [ 2019 Jan 04 ]

I don't think there's an error in my maths, if 4h of events = 747 records, that's what is being deleted. That will take 38 years at the current rate.

Is there any way to speed this up? Is there a way to run the housekeeper manually but just for the events piece?

Remembering the motivation is to reduce the DB size as far as possible to allow us to upgrade to v4 with minimal downtime.

Comment by Edgars Melveris [ 2019 Jan 04 ]

Actually there is an error, i'll explain:

Housekeeper deletes 4 hours of events at each iteration, but there is no limit on amount of events here. So, the actual amount of deleted events depends on how many events where generated in some historic 4 hour window.

As I stated before, you can manually execute housekeeper to run more frequently with command:

zabbix_server -R housekeeper_execute

You should watch how much load this creates on the DB and execute the command as frequently as the DB load allows.

Comment by Sean Nienaber [ 2019 Jan 04 ]

This isn't useful in reducing the DB size. The housekeeper also isn't able to run continually for a period and has to be started manually which isn't useful either.

There is no way we've had 247 million+ events in 4.5 years we've run Zabbix, something along the way has gone crazy and massively increased the DB size, particularly the events table and the housekeeper isn't built to deal with this many records.

What would the impact be of truncating the events and event_recovery tables and starting the Server?

The upgrade process is way too inefficient to deal with tables this size.

Comment by Vladislavs Sokurenko [ 2019 Jan 04 ]

I think the most important is to find out why are there so many events and if they keep on being generated in large amounts. Your setup is to store events for 90 days, all events prior to those 90 days should be cleared within a month if I am not mistaken. But I don’t guarantee that there are not millions more in the last 90 days

Comment by Sean Nienaber [ 2019 Jan 04 ]

That's a good point, I just checked how many events there are in the last 90 days, 163 million!

mysql> select count(*) from events where clock > UNIX_TIMESTAMP(now() - interval 3 month);
+-----------+
| count(*) |
+-----------+
| 163709331 |
+-----------+
1 row in set (1 min 53.74 sec)

How do we work out what is generating this activity?

Comment by Sean Nienaber [ 2019 Jan 04 ]

Here are the events grouped by day, something is very busy in there:

mysql> select date(from_unixtime(clock)), count(*) from events group by date(from_unixtime(clock));
+----------------------------+----------+
| date(from_unixtime(clock)) | count(*) |
+----------------------------+----------+
| 2017-09-27 | 2 |
| 2017-09-28 | 2 |
| 2017-10-01 | 8 |
| 2017-10-03 | 53 |
| 2017-10-10 | 9 |
| 2017-10-11 | 6 |
| 2017-10-12 | 1 |
| 2017-10-24 | 1 |
| 2017-10-25 | 1 |
| 2017-10-27 | 14 |
| 2017-10-28 | 1 |
| 2017-10-31 | 1 |
| 2017-11-12 | 4 |
| 2017-11-29 | 17 |
| 2017-12-05 | 9 |
| 2017-12-09 | 2 |
| 2017-12-10 | 2 |
| 2017-12-14 | 1 |
| 2017-12-19 | 45 |
| 2017-12-20 | 15 |
| 2017-12-25 | 8 |
| 2017-12-28 | 2 |
| 2017-12-29 | 1 |
| 2018-01-03 | 2 |
| 2018-01-09 | 1 |
| 2018-01-16 | 1 |
| 2018-01-19 | 4 |
| 2018-01-22 | 22 |
| 2018-01-26 | 4 |
| 2018-01-27 | 1 |
| 2018-02-05 | 1 |
| 2018-02-06 | 44 |
| 2018-02-07 | 1 |
| 2018-02-08 | 22 |
| 2018-02-09 | 4 |
| 2018-02-10 | 20 |
| 2018-02-12 | 23 |
| 2018-02-13 | 5 |
| 2018-02-14 | 6 |
| 2018-02-16 | 1 |
| 2018-02-17 | 3 |
| 2018-02-19 | 3 |
| 2018-02-21 | 2 |
| 2018-02-22 | 1 |
| 2018-02-26 | 5 |
| 2018-02-27 | 2 |
| 2018-03-02 | 1 |
| 2018-03-06 | 4 |
| 2018-03-14 | 4 |
| 2018-03-16 | 4 |
| 2018-03-23 | 1 |
| 2018-03-26 | 2 |
| 2018-03-27 | 2 |
| 2018-03-29 | 1 |
| 2018-04-10 | 2 |
| 2018-04-12 | 4 |
| 2018-04-20 | 1 |
| 2018-04-26 | 26 |
| 2018-04-30 | 104 |
| 2018-05-01 | 7 |
| 2018-05-02 | 4 |
| 2018-05-04 | 3752 |
| 2018-05-05 | 2504 |
| 2018-05-06 | 2750 |
| 2018-05-07 | 3357 |
| 2018-05-08 | 4110 |
| 2018-05-09 | 3671 |
| 2018-05-10 | 4467 |
| 2018-05-11 | 3830 |
| 2018-05-12 | 2483 |
| 2018-05-13 | 2492 |
| 2018-05-14 | 5124 |
| 2018-05-15 | 3855 |
| 2018-05-16 | 3618 |
| 2018-05-17 | 3691 |
| 2018-05-18 | 3960 |
| 2018-05-19 | 2810 |
| 2018-05-20 | 2467 |
| 2018-05-21 | 3705 |
| 2018-05-22 | 3880 |
| 2018-05-23 | 3403 |
| 2018-05-24 | 3284 |
| 2018-05-25 | 3338 |
| 2018-05-26 | 2426 |
| 2018-05-27 | 2291 |
| 2018-05-28 | 2453 |
| 2018-05-29 | 3289 |
| 2018-05-30 | 3646 |
| 2018-05-31 | 4192 |
| 2018-06-01 | 3725 |
| 2018-06-02 | 2933 |
| 2018-06-03 | 2454 |
| 2018-06-04 | 3515 |
| 2018-06-05 | 3859 |
| 2018-06-06 | 4838 |
| 2018-06-07 | 4842 |
| 2018-06-08 | 4892 |
| 2018-06-09 | 3485 |
| 2018-06-10 | 2420 |
| 2018-06-11 | 4198 |
| 2018-06-12 | 3737 |
| 2018-06-13 | 3838 |
| 2018-06-14 | 4065 |
| 2018-06-15 | 3648 |
| 2018-06-16 | 2116 |
| 2018-06-17 | 3017 |
| 2018-06-18 | 3519 |
| 2018-06-19 | 3220 |
| 2018-06-20 | 3584 |
| 2018-06-21 | 3884 |
| 2018-06-22 | 4353 |
| 2018-06-23 | 3570 |
| 2018-06-24 | 3093 |
| 2018-06-25 | 4809 |
| 2018-06-26 | 4935 |
| 2018-06-27 | 6659 |
| 2018-06-28 | 5543 |
| 2018-06-29 | 5984 |
| 2018-06-30 | 4900 |
| 2018-07-01 | 3043 |
| 2018-07-02 | 3607 |
| 2018-07-03 | 3758 |
| 2018-07-04 | 3571 |
| 2018-07-05 | 3563 |
| 2018-07-06 | 4297 |
| 2018-07-07 | 3085 |
| 2018-07-08 | 2202 |
| 2018-07-09 | 3192 |
| 2018-07-10 | 3389 |
| 2018-07-11 | 487938 |
| 2018-07-12 | 1153452 |
| 2018-07-13 | 957928 |
| 2018-07-14 | 1001811 |
| 2018-07-15 | 1008287 |
| 2018-07-16 | 1190356 |
| 2018-07-17 | 1082223 |
| 2018-07-18 | 1020283 |
| 2018-07-19 | 1025083 |
| 2018-07-20 | 1036210 |
| 2018-07-21 | 1009226 |
| 2018-07-22 | 1072314 |
| 2018-07-23 | 1128812 |
| 2018-07-24 | 1169995 |
| 2018-07-25 | 1190095 |
| 2018-07-26 | 1041166 |
| 2018-07-27 | 931244 |
| 2018-07-28 | 902702 |
| 2018-07-29 | 880302 |
| 2018-07-30 | 1083716 |
| 2018-07-31 | 1088828 |
| 2018-08-01 | 1001228 |
| 2018-08-02 | 950545 |
| 2018-08-03 | 949987 |
| 2018-08-04 | 937462 |
| 2018-08-05 | 904693 |
| 2018-08-06 | 835854 |
| 2018-08-07 | 838779 |
| 2018-08-08 | 858493 |
| 2018-08-09 | 896455 |
| 2018-08-10 | 833370 |
| 2018-08-11 | 853533 |
| 2018-08-12 | 894298 |
| 2018-08-13 | 925750 |
| 2018-08-14 | 901827 |
| 2018-08-15 | 808533 |
| 2018-08-16 | 711985 |
| 2018-08-17 | 678890 |
| 2018-08-18 | 608645 |
| 2018-08-19 | 613815 |
| 2018-08-20 | 793765 |
| 2018-08-21 | 804952 |
| 2018-08-22 | 836446 |
| 2018-08-23 | 868668 |
| 2018-08-24 | 929520 |
| 2018-08-25 | 874175 |
| 2018-08-26 | 642505 |
| 2018-08-27 | 755713 |
| 2018-08-28 | 789863 |
| 2018-08-29 | 613979 |
| 2018-08-30 | 566618 |
| 2018-08-31 | 736609 |
| 2018-09-01 | 803620 |
| 2018-09-02 | 814272 |
| 2018-09-03 | 1039462 |
| 2018-09-04 | 1029696 |
| 2018-09-05 | 977755 |
| 2018-09-06 | 832228 |
| 2018-09-07 | 987922 |
| 2018-09-08 | 1001128 |
| 2018-09-09 | 1032990 |
| 2018-09-10 | 1128069 |
| 2018-09-11 | 1062687 |
| 2018-09-12 | 1136769 |
| 2018-09-13 | 1144050 |
| 2018-09-14 | 1178691 |
| 2018-09-15 | 1123971 |
| 2018-09-16 | 818386 |
| 2018-09-17 | 853357 |
| 2018-09-18 | 874009 |
| 2018-09-19 | 1269912 |
| 2018-09-20 | 1874766 |
| 2018-09-21 | 1284660 |
| 2018-09-22 | 516284 |
| 2018-09-23 | 540359 |
| 2018-09-24 | 561346 |
| 2018-09-25 | 549441 |
| 2018-09-26 | 577190 |
| 2018-09-27 | 842282 |
| 2018-09-28 | 1493939 |
| 2018-09-29 | 1575426 |
| 2018-09-30 | 1564471 |
| 2018-10-01 | 1643681 |
| 2018-10-02 | 1679066 |
| 2018-10-03 | 1591423 |
| 2018-10-04 | 1622598 |
| 2018-10-05 | 1558648 |
| 2018-10-06 | 1638859 |
| 2018-10-07 | 1578255 |
| 2018-10-08 | 1534373 |
| 2018-10-09 | 1642342 |
| 2018-10-10 | 1597501 |
| 2018-10-11 | 1540224 |
| 2018-10-12 | 1557042 |
| 2018-10-13 | 1631430 |
| 2018-10-14 | 1508332 |
| 2018-10-15 | 1435330 |
| 2018-10-16 | 1556881 |
| 2018-10-17 | 1676588 |
| 2018-10-18 | 1709027 |
| 2018-10-19 | 1739604 |
| 2018-10-20 | 1667117 |
| 2018-10-21 | 1731081 |
| 2018-10-22 | 1765196 |
| 2018-10-23 | 1751049 |
| 2018-10-24 | 1831986 |
| 2018-10-25 | 1571768 |
| 2018-10-26 | 1723642 |
| 2018-10-27 | 1707701 |
| 2018-10-28 | 1893483 |
| 2018-10-29 | 1749670 |
| 2018-10-30 | 1773824 |
| 2018-10-31 | 1842670 |
| 2018-11-01 | 1775681 |
| 2018-11-02 | 1833726 |
| 2018-11-03 | 1915773 |
| 2018-11-04 | 1938331 |
| 2018-11-05 | 1930657 |
| 2018-11-06 | 2040900 |
| 2018-11-07 | 1994736 |
| 2018-11-08 | 2013475 |
| 2018-11-09 | 2112113 |
| 2018-11-10 | 2060544 |
| 2018-11-11 | 2012455 |
| 2018-11-12 | 2033378 |
| 2018-11-13 | 2016064 |
| 2018-11-14 | 1919199 |
| 2018-11-15 | 2022221 |
| 2018-11-16 | 1892420 |
| 2018-11-17 | 1621454 |
| 2018-11-18 | 1560197 |
| 2018-11-19 | 1602718 |
| 2018-11-20 | 1850977 |
| 2018-11-21 | 2136276 |
| 2018-11-22 | 2171698 |
| 2018-11-23 | 2177409 |
| 2018-11-24 | 2146121 |
| 2018-11-25 | 2199483 |
| 2018-11-26 | 2148626 |
| 2018-11-27 | 2125678 |
| 2018-11-28 | 2101219 |
| 2018-11-29 | 2106051 |
| 2018-11-30 | 1969603 |
| 2018-12-01 | 2130754 |
| 2018-12-02 | 2128221 |
| 2018-12-03 | 2084312 |
| 2018-12-04 | 2012997 |
| 2018-12-05 | 2102121 |
| 2018-12-06 | 2032193 |
| 2018-12-07 | 2129107 |
| 2018-12-08 | 2031661 |
| 2018-12-09 | 2050246 |
| 2018-12-10 | 2116187 |
| 2018-12-11 | 2173964 |
| 2018-12-12 | 2169501 |
| 2018-12-13 | 2169832 |
| 2018-12-14 | 2154175 |
| 2018-12-15 | 2065026 |
| 2018-12-16 | 2046611 |
| 2018-12-17 | 2217607 |
| 2018-12-18 | 2205513 |
| 2018-12-19 | 2171430 |
| 2018-12-20 | 2285761 |
| 2018-12-21 | 2328550 |
| 2018-12-22 | 2261486 |
| 2018-12-23 | 2316720 |
| 2018-12-24 | 2294054 |
| 2018-12-25 | 2323742 |
| 2018-12-26 | 2382900 |
| 2018-12-27 | 1621009 |
| 2018-12-28 | 132751 |
| 2018-12-29 | 132703 |
| 2018-12-30 | 104556 |
| 2018-12-31 | 86428 |
| 2019-01-01 | 121000 |
| 2019-01-02 | 102241 |
| 2019-01-03 | 110730 |
| 2019-01-04 | 68341 |
+----------------------------+----------+
307 rows in set (3 min 59.60 sec)

Comment by Sean Nienaber [ 2019 Jan 07 ]

Hi all,

I've just run the query above again and confirm the events are slowly being cleared:

| 2018-04-30 | 104 |
| 2018-05-01 | 7 |
| 2018-05-02 | 4 |
| 2018-05-04 | 7 |
| 2018-05-09 | 4 |
| 2018-05-10 | 3 |
| 2018-05-11 | 2 |
| 2018-05-14 | 2513 |
| 2018-05-15 | 3855 |
| 2018-05-16 | 3618 |
| 2018-05-17 | 3691 |
| 2018-05-18 | 3960 |
| 2018-05-19 | 2810 |
| 2018-05-20 | 2467 |

So I think that housekeeping is working as expected.

I'm thinking about disabling the housekeeper in Zabbix and then running a cron job to run it every 1 min to speed up the clear out, could I have your thoughts on this approach please? This would start the housekeeper every minute after it last completed.

Lastly then, how could I work out what is creating the massive amount of events where seeing?

There is a point at which they start looking at the results above, so I guess something changed either in config or perhaps Zabbix build as we have been upgrading continually:

| 2018-07-06 | 4296 |
| 2018-07-07 | 3085 |
| 2018-07-08 | 2202 |
| 2018-07-09 | 3192 |
| 2018-07-10 | 3389 |
| 2018-07-11 | 487937 |
| 2018-07-12 | 1153452 |
| 2018-07-13 | 957928 |
| 2018-07-14 | 1001811 |
| 2018-07-15 | 1008287 |
| 2018-07-16 | 1190356 |

Looking at the type of events, the source is internal for a massive majority (99.7%):

mysql> select source,count(source) from events group by source;
+--------+---------------+
| source | count(source) |
+--------+---------------+
| 0 | 752360 |
| 1 | 3503 |
| 2 | 65 |
| 3 | 246725235 |
+--------+---------------+
4 rows in set (1 min 59.46 sec)

And for those where the source is internal, the object type is mostly 4 (item, 85%) and 0 (trigger, 15%):

mysql> select object,count(object) from events where source = 3 group by object;
+--------+---------------+
| object | count(object) |
+--------+---------------+
| 0 | 37281667 |
| 4 | 209439771 |
| 5 | 6055 |
+--------+---------------+
3 rows in set (2 min 23.61 sec)

Triggers make sense I guess, but how do I work out the cause of the items events?

Comment by Vladislavs Sokurenko [ 2019 Jan 07 ]

Do you have any warnings in the log ? How many unsupported items do you have ?

Comment by Sean Nienaber [ 2019 Jan 07 ]

Yes this sort of thing:

20486:20190107:151354.816 cannot send list of active checks to "172.30.10.219": host [SBNVM05] not monitored
20481:20190107:151401.636 cannot send list of active checks to "172.30.10.203": host [VIASAGEAPP01] not monitored
20452:20190107:151401.921 cannot send list of active checks to "65.52.66.75": host [UPBRIDGE3-ANE] not monitored
20465:20190107:151421.630 cannot send list of active checks to "172.30.10.220": host [SBNVM04] not monitored
20451:20190107:151423.250 cannot send list of active checks to "40.112.72.159": host [UPPRINT3-ANE] not monitored

But nowhere near millions per day.

Comment by Sean Nienaber [ 2019 Jan 07 ]

Hmm, we do have 2,560 items not supported and I see we refresh unsupported after every 1min.

1,440 x 2,560 = 3,686,400

That could be it?

I've change the refresh interval to 15 min now.

Comment by Vladislavs Sokurenko [ 2019 Jan 07 ]

You could check if this number grows after one minute ? you could also group by state please ?
And in the end please set Internal data storage period to one day

Comment by Sean Nienaber [ 2019 Jan 07 ]

Here's every second for the last half hour:

mysql> select clock,count(clock) from events where clock > 1546873200 group by clock;
+------------+--------------+
| clock | count(clock) |
+------------+--------------+
| 1546873222 | 1 |
| 1546873249 | 1 |
| 1546873260 | 18 |
| 1546873280 | 1 |
| 1546873287 | 1 |
| 1546873324 | 1 |
| 1546873328 | 1 |
| 1546873336 | 1 |
| 1546873344 | 1 |
| 1546873376 | 1 |
| 1546873416 | 1 |
| 1546873456 | 1 |
| 1546873499 | 1 |
| 1546873522 | 1 |
| 1546873583 | 1 |
| 1546873610 | 1 |
| 1546873611 | 1 |
| 1546873627 | 1 |
| 1546873635 | 1 |
| 1546873664 | 1 |
| 1546873676 | 2 |
| 1546873708 | 1 |
| 1546873712 | 1 |
| 1546873716 | 1 |
| 1546873723 | 1 |
| 1546873740 | 8 |
| 1546873741 | 1 |
| 1546873760 | 1 |
| 1546873774 | 1 |
| 1546873799 | 1 |
| 1546873815 | 647 |
| 1546873824 | 1 |
| 1546873850 | 1 |
| 1546873859 | 153 |
| 1546873860 | 502 |
| 1546873871 | 2 |
| 1546873883 | 1 |
| 1546873923 | 1 |
| 1546873928 | 2 |
| 1546873936 | 1 |
| 1546873964 | 1 |
| 1546873975 | 1 |
| 1546874005 | 1 |
| 1546874008 | 1 |
| 1546874012 | 1 |
| 1546874016 | 1 |
| 1546874041 | 1 |
| 1546874042 | 1 |
| 1546874056 | 1 |
| 1546874060 | 1 |
| 1546874074 | 1 |
| 1546874124 | 1 |
| 1546874148 | 1 |
| 1546874172 | 2 |
| 1546874209 | 1 |
| 1546874211 | 1 |
| 1546874228 | 1 |
| 1546874230 | 2 |
| 1546874236 | 1 |
| 1546874278 | 1 |
| 1546874280 | 8 |
| 1546874302 | 1 |
| 1546874316 | 1 |
| 1546874353 | 1 |
| 1546874399 | 1 |
| 1546874424 | 1 |
| 1546874455 | 1 |
| 1546874456 | 1 |
| 1546874460 | 8 |
| 1546874472 | 2 |
| 1546874528 | 1 |
| 1546874529 | 2 |
| 1546874536 | 1 |
| 1546874568 | 1 |
| 1546874578 | 2 |
| 1546874639 | 1 |
| 1546874656 | 1 |
| 1546874699 | 2 |
| 1546874760 | 8 |
| 1546874782 | 1 |
| 1546874784 | 1 |
| 1546874792 | 1 |
| 1546874811 | 1 |
| 1546874832 | 1 |
| 1546874836 | 1 |
+------------+--------------+
85 rows in set (2 min 12.39 sec)

I don't see a state column?

mysql> describe events;
+--------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+---------------------+------+-----+---------+-------+
| eventid | bigint(20) unsigned | NO | PRI | NULL | |
| source | int(11) | NO | MUL | 0 | |
| object | int(11) | NO | | 0 | |
| objectid | bigint(20) unsigned | NO | | 0 | |
| clock | int(11) | NO | | 0 | |
| value | int(11) | NO | | 0 | |
| acknowledged | int(11) | NO | | 0 | |
| ns | int(11) | NO | | 0 | |
+--------------+---------------------+------+-----+---------+-------+
8 rows in set (0.01 sec)

The housekeeper is now running every minute via a cron job and this is helping a lot, were up to July 2018 already:

| 2018-07-05 | 3562 |
| 2018-07-06 | 4296 |
| 2018-07-07 | 3085 |
| 2018-07-08 | 2202 |
| 2018-07-09 | 3192 |
| 2018-07-10 | 3389 |
| 2018-07-11 | 487937 |
| 2018-07-12 | 1153452 |
| 2018-07-13 | 957928 |
| 2018-07-14 | 1001811 |
| 2018-07-15 | 1008287 |
| 2018-07-16 | 1190356 |
| 2018-07-17 | 1082223 |

Comment by Sean Nienaber [ 2019 Oct 14 ]

Hi all,

Is there anything we can do to push this ahead as it's almost a year in the making?

With v4.4 now released I was hoping this would be part of that release so we're going to have to hold back on upgrading.

Thanks,

Sean

Comment by Vladislavs Sokurenko [ 2019 Oct 15 ]

Fixed in:

pre-4.0.14rc1 a10595f5919
pre-4.2.8rc1 382d2b17552
pre-4.4.1rc1 701997e66d2
pre-5.0.0alpha1 (master) a583bd96d85

Comment by Sean Nienaber [ 2019 Oct 15 ]

Thanks for all the updates, this may be obvious but when is this change likely to go GA in v4.2 or v4.4?

Comment by Vladislavs Sokurenko [ 2019 Oct 15 ]

It is already merged in mentioned versions and will be available in next release, this shouldn't be too long.

Comment by Sean Nienaber [ 2019 Oct 15 ]

Brilliant, thanks.

[ZBX-15222] Housekeeping not working Created: 2018 Nov 27 Updated: 2024 Apr 10 Resolved: 2019 Oct 16
Status:	Closed
Project:	ZABBIX BUGS AND ISSUES
Component/s:	Server (S)
Affects Version/s:	3.0.24
Fix Version/s:	4.0.14rc1, 4.2.8rc1, 4.4.1rc1, 5.0.0alpha1, 5.0 (plan)

[ZBX-15222] Housekeeping not working Created: 2018 Nov 27 Updated: 2024 Apr 10 Resolved: 2019 Oct 16