[ZBX-16486] Housekeeper locking database on events cleanup Created: 2019 Aug 09 Updated: 2020 Jun 18 Resolved: 2020 Jun 18 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 4.2.5 |
Fix Version/s: | None |
Type: | Problem report | Priority: | Trivial |
Reporter: | Daniel | Assignee: | Renats Valiahmetovs (Inactive) |
Resolution: | Workaround proposed | Votes: | 0 |
Labels: | pending | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Zabbix Server 4.2.5 |
Attachments: |
![]() |
Description |
Notes on setup: *Due an issue on a trigger expression (percentile, "not enough data") the events table has grown over time with a higher amount of *internal events. When housekeeper retention has started to cleanup these events, approx. 385000 records per hour (= per run in setup) have to be deleted. **Steps to reproduce: Housekeeper is starting automatically. May be started using runtime command also. All data is collected using proxies - no direct data collection (except zabbix server itself) at all. Result: Zabbix is no longer able to insert data into the database because housekeeper is locking tables; Some runs are fast and do not cause any issues, but on some cleanups the housekeeper took very long. This causes zabbix to stop on data insertion and as a result zabbix sends alerts on missed values (= nodata-triggers firing). From a users perspective the whole system stalls and generates huge amounts alerts (which are again resolved when housekeeper completes). Here's an example from the log;
70425:20190807:184948.012 housekeeper [deleted 0 hist/trends, 43 items/triggers, 385281 events, 371351 problems, 0 sessions, 0 alarms, 44 audit, 0 records in 276.169684 sec, idle for 1 hour(s)] 70425:20190807:195114.227 housekeeper [deleted 0 hist/trends, 0 items/triggers, 384215 events, 55320 problems, 0 sessions, 0 alarms, 0 audit, 0 records in 85.693726 sec, idle for 1 hour(s)] 70425:20190807:205246.543 housekeeper [deleted 0 hist/trends, 0 items/triggers, 384691 events, 61691 problems, 0 sessions, 0 alarms, 0 audit, 0 records in 91.810570 sec, idle for 1 hour(s)] 70425:20190807:215417.068 housekeeper [deleted 0 hist/trends, 0 items/triggers, 385197 events, 61755 problems, 0 sessions, 0 alarms, 0 audit, 0 records in 90.007166 sec, idle for 1 hour(s)] 70425:20190808:035033.912 housekeeper [deleted 0 hist/trends, 0 items/triggers, 384004 events, 60042 problems, 0 sessions, 0 alarms, 0 audit, 0 records in 17776.345255 sec, idle for 1 hour(s)] 70425:20190808:045743.645 housekeeper [deleted 0 hist/trends, 0 items/triggers, 384523 events, 308902 problems, 0 sessions, 0 alarms, 0 audit, 0 records in 428.900984 sec, idle for 1 hour(s)] 70425:20190808:055915.332 housekeeper [deleted 0 hist/trends, 0 items/triggers, 383333 events, 56282 problems, 0 sessions, 0 alarms, 0 audit, 0 records in 91.171846 sec, idle for 1 hour(s)] 70425:20190808:070045.155 housekeeper [deleted 0 hist/trends, 0 items/triggers, 384757 events, 50566 problems, 0 sessions, 0 alarms, 0 audit, 0 records in 89.316387 sec, idle for 1 hour(s)] 70425:20190808:080226.627 housekeeper [deleted 0 hist/trends, 0 items/triggers, 386847 events, 50104 problems, 0 sessions, 0 alarms, 0 audit, 0 records in 100.970520 sec, idle for 1 hour(s)] 70425:20190808:082323.115 housekeeper [deleted 0 hist/trends, 0 items/triggers, 386676 events, 19420 problems, 0 sessions, 0 alarms, 0 audit, 0 records in 81.158017 sec, idle for 1 hour(s)] The run on 2019/08/08 03:50:33 took 17776 seconds which is by far higher than the others. While this issue occurs there is no other tasks running on the system or environment (like backups) - it's dedicated to zabbix server. Checking the mysql server i've been able to extract one of the queries that are blocking the server while this issue occurs.
Expected: Housekeeper to be less intrusive; Splitting the events query into smaller chunks may help to prevent running a single delete operation that locks the database for too long. |
Comments |
Comment by Andrei Gushchin (Inactive) [ 2019 Aug 09 ] |
Probably all fine divided to chunks with MaxHousekeeperDelete server parameter. |
Comment by Daniel [ 2019 Aug 09 ] |
Currently: MaxHousekeeperDelete=50000 |
Comment by Andrei Gushchin (Inactive) [ 2019 Aug 09 ] |
Probably default 5000 will have less impact to DB. |
Comment by Daniel [ 2019 Aug 09 ] |
To cleanup manually we've run similar queries with limits so 150000 records over a longer timeframe (and inserted sleeps to slow down). Each query has taken approx 20 seconds then. While queries completed zabbix server has been able to insert pending data and process it's backlog. Using this we've cleaned 100+mio records while zabbix itself kept running fine. |
Comment by Andrei Gushchin (Inactive) [ 2019 Aug 30 ] |
Is that setup was created a long time ago(upgraded from older version). I see that you have pretty a lot of records in the events and problem tables. |
Comment by Daniel [ 2019 Aug 30 ] |
The setup itself is running since Zabbix 2.x - so it's running quite some time. But this isn't the reason for the table size. We've been able to identify the issue: We've had an issue in a trigger expression that has been raising around 25-28 errors per second (because many instances of these trigger exist). So we've had a high ingress rate on the table and the regular cleanup process couldn't handle this. So most of the data have been of type internal. We've fixed the expression and cleaned the table manually. So we've resolved the issue on our installation by cleaning manually. The locking issue technically still exists. |
Comment by Andrei Gushchin (Inactive) [ 2019 Aug 30 ] |
Thank you for the update. |
Comment by Daniel [ 2019 Aug 30 ] |
I'm not sure what queries have been blocked on insertion - insertion just stopped when data from proxies has been received and been written to database. It would be plausible that during insertion triggers would have been validated and therefore zabbix would have tried to insert events into the events table (as a trigger had an issue on an expression). So perhaps the lock has been blocking this process which again blocked the regular item insertion process. |
Comment by Renats Valiahmetovs (Inactive) [ 2020 Jun 16 ] |
Hello Daniel, Since there was no activity within this report, I will be closing it in 7 days. I would like to share my thoughts about this issue, however: It is obvious that your DB holds a huge amount of events, so what I would suggest is to delete entries in Zabbix DB inside events table where source=1 and source=2 and source=3 ,but please note, this should be done in batches in order to not to affect performance. Please run the following query on your DB just to remove discovery, auto-registration and internal events. delete from events where source=1 or source=2 or source=3 limit 50000; You can safely remove these entries in batches, this will free up resources for Housekeeper so it may run without slowdowns. Best Regards,
|
Comment by Daniel [ 2020 Jun 16 ] |
Thanks - that's exactly what we've done so far. |
Comment by Renats Valiahmetovs (Inactive) [ 2020 Jun 18 ] |
Hi Daniel, Glad to hear that you've managed to get around this issue. I will be closing the report, but feel free to reopen it, if there's anything else. Best Regards, |