[ZBXNEXT-2355] provide an ability to later understand why alerts were skipped during maintenance, why duplicated event created Created: 2014 Jun 25  Updated: 2024 Mar 27  Resolved: 2016 Dec 20

Status: Closed
Project: ZABBIX FEATURE REQUESTS
Component/s: Server (S)
Affects Version/s: 2.2.3, 2.3.2
Fix Version/s: None

Type: Change Request Priority: Critical
Reporter: Oleksii Zagorskyi Assignee: Unassigned
Resolution: Duplicate Votes: 7
Labels: maintenance, troubleshooting, usability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates ZBXNEXT-1150 Event should store and show the maint... Open

 Description   

(1)
Previously (before v 2.2.4 - ZBX-8230) it was possible to at least find some details in zabbix_server.log (in production users usually use DebugLevel=3) that hosts were in maintenance.
And I know cases when users indeed check the log file when need to make sure maintenance activities.

Currently - no any way at all to make sure why this or that events don't generate alerts.
Also, we know that when timer takes a host out of maintenance it creates the same event as last event created during maintenance (this point requested to be documented in ZBX-8390).
This behavior implemented in "generate_events" function (timer.c).

I have a question - suppose after a week (note - my memory limited by 3 days only) of a host maintenance I investigate host's events and I cannot understand why for this particular problem event zabbix didn't send an alert?
I completely forgot about the maintenance a week ago, or I didn't know about it at all because someone else configured it and later it was modified or deleted.

(2)
Also I have another puzzle question - why here are two Ok|Problem events in a row, and for 2nd such event time stamp I don't see values in items history?

To resolve all this and provide clear info to zabbix users I suggest to implement a feature that:
for events which were created during maintenance on Monitoring -> Events page in a "Actions" column users will see something like "Maintenance, skipped" or there will an icon with a tool-tip, which also could include full maintenance name.

It should help to understand why alerts were missing AND!!! will provide a hint that next the same Ok|Problem event above generated by timer according to internal zabbix routines ("generate_events" function).
Then it will much more clear for end users what happened.

Where it could stored - I don't know. The "events" table don't have suitable columns.
I guess currently numbers (if there were alerts) for the "Actions" column taken from "alerts" table.
Maybe we could store something to the "alerts" table and then display it in a special way?



 Comments   
Comment by Oleksii Zagorskyi [ 2015 Dec 08 ]

This issue additionally drive zabbix people crazy in ZBX-9432

Comment by Chris Christensen [ 2015 Dec 08 ]

Agree ^ see the comment thread ~Dec 8th in ZBX-9432 for a case that shows duplicate OK events being generated around maintenance (and also no way to see in/out of maint status in Zabbix - all lookups were done from other system logs calling the maint API. Logging and/or UI improvements would definitely be helpful.)

Comment by Oleksii Zagorskyi [ 2015 Dec 09 ]

A case with 3 OK events (find it in the ZBX-9432) should be additionally tested after development.
It looks like 3rd OK was created after maintenance, based on 2nd OK event.

Comment by richlv [ 2016 Jan 20 ]

host going in/out of maintenance could be registered in internal events (but there's no way to nicely view those, asked in ZBXNEXT-2170), or maybe in the audit log (although audit log mostly deals with changes in the config, not runtime status like maintenance)

Comment by Oleksii Zagorskyi [ 2016 Jan 20 ]

Rich's idea is very good !
During discussion of ZBX-10265 with neogan I also shared the same idea.
Additional idea is to discard generating duplicated event after maintenance at all.
Alerts generated after maintenance could be "linked" to original events created during maintenance.
To do that we for example should start escalation during maintenance, but generate after maintenance, if required.

As for timestamp of last host PROBLEM event (on top of events): what is more usable for NOC - a timestamp when a server went to shutdown (and didn't get back to alive) or a timestamp of maintenance end (duplicated event) ?
With neogan we ended up that first one is even more useful to see for NOC.

Another related discussion ZBXNEXT-2141.

I think it should be widely discussed.

Comment by Oleksii Zagorskyi [ 2016 Jan 27 ]

Another issue where duplicated event after maintenance is misleading - ZBX-8558

Comment by richlv [ 2016 Apr 15 ]

ZBXNEXT-3196 proposes to pause escalations during maintenance and suggests that extra events wouldn't be generated anymore either

zalex_ua And it was indeed implemented that way. No more duplicated event after maintenance, starting from 3.2

Comment by Oleksii Zagorskyi [ 2016 Nov 09 ]

Another use case - one wants to prepare some report (sort of "Triggers top 100" report) based on "events" table and would want to take maintenance periods (active in the past, with data collection) into account ...
No way to do it currently

Comment by Oleksii Zagorskyi [ 2016 Dec 20 ]

We found a more old similar request - ZBXNEXT-1150
Closing this one as duplicate.

Generated at Fri Apr 26 23:54:32 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.