[ZBX-11426] Events removed by housekeeper can cause trigger to be stuck in problem state Created: 2016 Nov 04  Updated: 2024 Apr 10  Resolved: 2017 Nov 28

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 3.2.1
Fix Version/s: 3.2.9rc1, 3.2.11rc1, 3.4.3rc1, 3.4.5rc1, 4.0.0alpha1, 4.0 (plan)

Type: Problem report Priority: Critical
Reporter: Andris Zeila Assignee: Andrea Biscuola (Inactive)
Resolution: Fixed Votes: 6
Labels: events, housekeeper
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Causes
causes ZBX-13140 Potential leak of problem and events Open
causes ZBX-13275 Slow Housekeeping of events Closed
causes ZBX-14312 Proxy->Agent communication drops inte... Closed
causes ZBX-13277 Housekeeper does not delete old event... Closed
Duplicate
Sub-task
depends on ZBX-12758 Postgresql problem table missing inde... Closed
Team: Team A
Sprint: Sprint 14, Sprint 15, Sprint 16, Sprint 17, Sprint 19, Sprint 21, Sprint 22
Story Points: 3.5

 Description   

When housekeeper removes open problem event the trigger value/problem count is not updated. If this was the last open problem event then trigger will be stuck in problem state and keep generating recovery events.

To fix the current situation recovery events must update trigger value/problem count event if there were no open problems

To avoid this from happening in future housekeeper must not remove open problem events.



 Comments   
Comment by Oleksii Zagorskyi [ 2016 Nov 08 ]

ZBX-11439 is similar/related.

Comment by Aleksandrs Saveljevs [ 2016 Nov 08 ]

ZBX-11412 may be related, too.

Comment by Andris Zeila [ 2016 Nov 10 ]

ZBX-11454 was created to deal with the fallout while this issue will be kept open to fix the housekeeper.

Comment by Alexander Vladishev [ 2017 Aug 10 ]

ZBX-11768 also may be related.

Comment by Andrea Biscuola (Inactive) [ 2017 Sep 20 ]

Fixed in svn://svn.zabbix.com/branches/dev/ZBX-11426

Modified the filters in the housekeeping_events() function for checking through a subquery if an event have an associated problem in the problem table. Remove only the events without a corresponding record (open or closed).
Also, reordered the deletion query for an easier adding of the filter.

Comment by Andris Zeila [ 2017 Sep 20 ]

Successfully tested, please review minor changes in r72783

abs Looks good. CLOSED

Comment by Andrea Biscuola (Inactive) [ 2017 Sep 26 ]

Released in:

  • pre-3.2.9rc1 r72945-r72945
  • pre-3.4.3rc1 r72947
  • pre-4.0.0alpha1 (trunk) r72948
Comment by richlv [ 2017 Sep 28 ]

this might be worth documenting in the housekeeper section (and maybe also in the upgrade notes for 3.2.9 and 3.4.3)

Comment by Andrea Biscuola (Inactive) [ 2017 Sep 29 ]

richlv

Maybe a good idea, as now the housekeeper behaviour is explicit regarding how some types of events are kept or deleted. The issue itself was already mitigated in the past through another task and this is just the completion of that work.

Comment by richlv [ 2017 Oct 06 ]

indeed, currently the behaviour seems to be completely undocumented

Comment by Andris Zeila [ 2017 Oct 13 ]

With event housekeeping period set to 1d (or close to it) there is a danger of recovery events being removed while the recovered events are still in problem table.

I'm not sure if it's worth adding more complexity to event deleting queries (although it would be the safest way). I think acceptable workaround would be to call housekeeping_problems() before housekeeping_events() function. As the event housekeeping period cannot be less than problem cleanup period (24h) this would ensure that problems are removed from problem table before corresponding events are removed from events table.

wiper So it was decided to have proper fix. To do it we need to add problem.r_eventid index and also check for r_eventid when removing events.
Still it's better to swap housekeeping_problems() and housekeeping_events() calls so problems table could have potentially less records when housekeeping_events() is called.

Comment by Andrea Biscuola (Inactive) [ 2017 Nov 14 ]

Fixed in svn://svn.zabbix.com/branches/dev/ZBX-11426

Swap the calls to housekeeping_problems() and housekeeping_events().
Logically, it's safer to remove old problems first and after that the
related events if necessary.

Also added a filter to the events delete queries for checking the
problem.r_eventid field.
In this way we ensure that any event that is associated with a problem
record in any way (being it a problem or recovery event), will not be
deleted before the problem record itself, but only after.

Comment by Andris Zeila [ 2017 Nov 16 ]

Successfully tested, please review coding style fixes in r74679

abs style fix ok. CLOSED

Comment by Andrea Biscuola (Inactive) [ 2017 Nov 17 ]

Released in:

  • pre-3.2.11rc1 r74716
  • pre-3.4.5rc1 r74717
  • pre 4.0.0alpha1 (trunk) r74718
Comment by Andrea Biscuola (Inactive) [ 2017 Nov 17 ]

martins-v

The housekeeper final behaviour after this change is that an event
will be deleted ONLY if is not associated with a problem in any way.
This mean that if an event is either a PROBLEM or RECOVERY event,
it will not be deleted until the related problem record is removed.

Also, now the housekeeper will delete problems first and events
after, for avoiding potential problems with stale events or problem
records.

Generated at Tue Apr 23 12:56:16 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.