[ZBX-10265] Generating duplicated events after maintenance even when trigger didn't change state during real maintenance period Created: 2016 Jan 14  Updated: 2017 Oct 11  Resolved: 2017 Oct 11

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Documentation (D), Server (S)
Affects Version/s: 2.2.11, 2.4.7, 3.0.0alpha6
Fix Version/s: None

Type: Incident report Priority: Major
Reporter: Andrei Gushchin (Inactive) Assignee: Unassigned
Resolution: Duplicate Votes: 4
Labels: duplicates, events, maintenance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates ZBXNEXT-3196 New option for actions: delay escalat... Closed

 Description   

Steps for reproduce:

  • make trigger in PROBLEM state
  • acknowledge it
  • put host in active maintenance
  • make sure that timer mark host under maintenance
  • put out host from maintenance

You see that was generated new event and trigger become unacknowledged, it is unexpected behavior for setups where acknowledging used as indication that problem is known and somebody work on it.



 Comments   
Comment by Oleksii Zagorskyi [ 2016 Jan 18 ]

It's known and already documented (in ZBX-8390) behavior.
ZBXNEXT-2355, ZBXNEXT-2141 are related.

Comment by Andrei Gushchin (Inactive) [ 2016 Jan 18 ]

Hmm In ZBX-8390 described that if trigger go into PROBLEM during maintenance there will be each one generated event.
But in my case trigger already was in PROBLEM state and I just put host to active maintenance period then get make it out, it also generate additional event, it more looks like bug then not documented thing.

For example if we have 10000+ servers few was down or broke, NOC team is ACK their trigger then create maintenance, and put bunch of hosts(with hostgroup) included that two to maintenance, after awhile get them out from maintenance manually, It will generate new events -> notifications,escalations etc. , at least NOC team should ACK them again.

Comment by Oleksii Zagorskyi [ 2016 Jan 18 ]

Hard to believe ... but I tested and it appeared to be a true !!!

A function mentioned in the ZBX-8390 says:

/******************************************************************************
 *                                                                            *
 * Function: generate_events                                                  *
 *                                                                            *
 * Purpose: generate events for triggers after maintenance period             *
 *          The events will be generated if trigger changed its state during  *
 *          the maintenance                                                   *

But for PROBLEM/OK event duplicated created even if original event was created before host was switched "in maintenance" state.
A key point here is - already configured maintenance, i.e. a when "maintenance from" is much earlier than when a host added to an already configured active maintenance.

And it's a roulette what trigger status was before the already configured active maintenance.
In the ZBX-8390 I probably didn't try to do any test or if I even did it, I created maintenance with real/current start and stop timestamps.

Problem is that we do not store real timestamps when hosts status being changed to maintenance instead we store maximal clock from maintenance ("Active since" from global or "Data", "At (hour:minute)" from periods) in hosts.maintenance_from.

We could use and store timestamp of current time, i.e. now(). It would be much more correct, IMO.
I discussed this idea with sasha a bit, but he was concerned about for example data coming from proxies with delays.
Trying to thinking on his concerns ... I cannot find a good use case that it would be serious problem.
The only edge case, for example, would be if zabbix server was not running before existing configured maintenance start, so a host would be really switched to maintenance only after zabbix server would be again started (and accepted delayed data from proxies).
But that's not so critical because for current implementation there still are delays to "applying" host maintenance status - up to 1 minute to process maintenances and some time to update configuration cache (for "no data collection" mode).
So I'd still consider now() as a more correct value for hosts.maintenance_from.

This issue is critical enough as we are aware of many zabbix users/etc who, using zabbix API, puts many hosts/groups to already existing very "wide time" maintenance.

Even more, this bug may create multiple (more than 2) consecutive same type events, which drives us crazy long time already in ZBX-9432.
I easily reproduced 3+ OK events creations, putting adding/removing a host to/from maintenance several times.

Debug log (2 parts - for PROBLEM and OK events):

 13687:20160118:190738.147 Starting Zabbix Server. Zabbix 2.4.7rc1 (revision 54982).
 13701:20160118:190738.178 server #13 started [timer #1]
...
--FOR DUPLICATED OK:
 13701:20160118:191900.351 In process_maintenance()
 13701:20160118:191900.351 query [txnlev:0] [select m.maintenanceid,m.maintenance_type,m.active_since,tp.timeperiod_type,tp.every,tp.month,tp.dayofweek,tp.day,tp.start_time,tp.period,tp.start_date from maintenances m,maintenances_windows mw,timeperiods tp where m.maintenanceid=mw.maintenanceid and mw.timeperiodid=tp.timeperiodid and m.active_since<=1453137540 and m.active_till>1453137540]
 13701:20160118:191900.352 In process_maintenance_hosts()
 13701:20160118:191900.352 query [txnlev:0] [select h.hostid,h.host,h.maintenanceid,h.maintenance_status,h.maintenance_type,h.maintenance_from from maintenances_hosts mh,hosts h where mh.hostid=h.hostid and h.status=0 and mh.maintenanceid=1]
 13701:20160118:191900.353 query [txnlev:0] [select h.hostid,h.host,h.maintenanceid,h.maintenance_status,h.maintenance_type,h.maintenance_from from maintenances_groups mg,hosts_groups hg,hosts h where mg.groupid=hg.groupid and hg.hostid=h.hostid and h.status=0 and mg.maintenanceid=1]
 13701:20160118:191900.354 End of process_maintenance_hosts()
 13701:20160118:191900.354 In update_maintenance_hosts()
 13701:20160118:191900.354 query [txnlev:1] [begin;]
 13701:20160118:191900.354 query [txnlev:1] [select hostid,host,maintenance_type,maintenance_from from hosts where status=0 and flags<>2 and maintenance_status=1 and not hostid=10124]
 13701:20160118:191900.355 taking host 'it0' out of maintenance
 13701:20160118:191900.355 query [txnlev:1] [update hosts set maintenanceid=null,maintenance_status=0,maintenance_type=0,maintenance_from=0 where hostid=10105]
 13701:20160118:191900.355 query [txnlev:1] [commit;]
 13701:20160118:191900.361 In generate_events()
 13701:20160118:191900.361 query [txnlev:0] [select distinct t.triggerid,t.description,t.expression,t.priority,t.type,t.lastchange,t.value from triggers t,functions f,items i where t.triggerid=f.triggerid and f.itemid=i.itemid and t.status=0 and i.status=0 and i.state=0 and i.hostid=10105]
 13701:20160118:191900.363 In get_trigger_values()
 13701:20160118:191900.363 get_trigger_values() from:'2016.01.01 00:00:00'
 13701:20160118:191900.363 query [txnlev:0] [select value from events where source=0 and object=0 and objectid=14239 and clock<1451599200 order by eventid desc limit 1]
 13701:20160118:191900.364 query [txnlev:0] [select value from events where source=0 and object=0 and objectid=14239 and clock>=1451599200 and value=1 limit 1]
 13701:20160118:191900.364 End of get_trigger_values() before:0 inside:1 after:0
 13701:20160118:191900.364 In process_events() events_num:1
 13701:20160118:191900.364 In DCget_nextid() table:'events' num:1
 13701:20160118:191900.364 End of DCget_nextid() table:'events' [246:246]
 13701:20160118:191900.365 query without transaction detected
 13701:20160118:191900.365 query [txnlev:0] [insert into events (eventid,source,object,objectid,clock,ns,value) values (246,0,0,14239,1453137540,0,0);
]
 13701:20160118:191900.370 In process_actions() events_num:1
 13701:20160118:191900.370 query [txnlev:0] [select actionid,evaltype,formula from actions where status=0 and eventsource=0]
 13701:20160118:191900.371 In check_action_conditions() actionid:15
 13701:20160118:191900.371 query [txnlev:0] [select conditionid,conditiontype,operator,value from conditions where actionid=15 order by conditiontype]
 13701:20160118:191900.371 In check_action_condition() actionid:15 conditionid:33 cond.value:''
 13701:20160118:191900.371 In check_trigger_condition()
 13701:20160118:191900.371 query [txnlev:0] [select count(*) from hosts h,items i,functions f,triggers t where h.hostid=i.hostid and h.maintenance_status=0 and i.itemid=f.itemid and f.triggerid=t.triggerid and t.triggerid=14239]
 13701:20160118:191900.372 End of check_trigger_condition():SUCCEED
 13701:20160118:191900.372 End of check_action_condition():SUCCEED
 13701:20160118:191900.372 End of check_action_conditions():SUCCEED
 13701:20160118:191900.372 In DCget_nextid() table:'escalations' num:1
 13701:20160118:191900.372 End of DCget_nextid() table:'escalations' [6:6]
 13701:20160118:191900.372 query without transaction detected
 13701:20160118:191900.372 query [txnlev:0] [insert into escalations (escalationid,actionid,status,triggerid,itemid,eventid,r_eventid) values (6,15,0,14239,null,246,null);
]
 13701:20160118:191900.377 End of process_actions()
 13701:20160118:191900.377 In DBupdate_itservices()
 13701:20160118:191900.377 In its_flush_updates()
 13701:20160118:191900.377 In its_load_services_by_triggerids()
 13701:20160118:191900.377 query [txnlev:0] [select serviceid,triggerid,status,algorithm from services where triggerid=14239]
 13701:20160118:191900.378 End of its_load_services_by_triggerids()
 13701:20160118:191900.378 End of its_flush_updates():SUCCEED
 13701:20160118:191900.378 End of DBupdate_itservices():SUCCEED
 13701:20160118:191900.378 End of process_events()
 13701:20160118:191900.378 End of generate_events()
 13701:20160118:191900.378 End of update_maintenance_hosts()
 13701:20160118:191900.378 End of process_maintenance()
 13701:20160118:191900.378 timer #1 [processed 1 triggers, 0 events in 0.002521 sec, 1 maintenances in 0.026423 sec, idle 30 sec]

...
--FOR DUPLICATED PROBLEM: 
 13701:20160118:192200.413 In process_maintenance()
 13701:20160118:192200.413 query [txnlev:0] [select m.maintenanceid,m.maintenance_type,m.active_since,tp.timeperiod_type,tp.every,tp.month,tp.dayofweek,tp.day,tp.start_time,tp.period,tp.start_date from maintenances m,maintenances_windows mw,timeperiods tp where m.maintenanceid=mw.maintenanceid and mw.timeperiodid=tp.timeperiodid and m.active_since<=1453137720 and m.active_till>1453137720]
 13701:20160118:192200.414 In process_maintenance_hosts()
 13701:20160118:192200.414 query [txnlev:0] [select h.hostid,h.host,h.maintenanceid,h.maintenance_status,h.maintenance_type,h.maintenance_from from maintenances_hosts mh,hosts h where mh.hostid=h.hostid and h.status=0 and mh.maintenanceid=1]
 13701:20160118:192200.415 query [txnlev:0] [select h.hostid,h.host,h.maintenanceid,h.maintenance_status,h.maintenance_type,h.maintenance_from from maintenances_groups mg,hosts_groups hg,hosts h where mg.groupid=hg.groupid and hg.hostid=h.hostid and h.status=0 and mg.maintenanceid=1]
 13701:20160118:192200.415 End of process_maintenance_hosts()
 13701:20160118:192200.415 In update_maintenance_hosts()
 13701:20160118:192200.415 query [txnlev:1] [begin;]
 13701:20160118:192200.416 query [txnlev:1] [select hostid,host,maintenance_type,maintenance_from from hosts where status=0 and flags<>2 and maintenance_status=1 and not hostid=10124]
 13701:20160118:192200.416 taking host 'it0' out of maintenance
 13701:20160118:192200.416 query [txnlev:1] [update hosts set maintenanceid=null,maintenance_status=0,maintenance_type=0,maintenance_from=0 where hostid=10105]
 13701:20160118:192200.417 query [txnlev:1] [commit;]
 13701:20160118:192200.422 In generate_events()
 13701:20160118:192200.422 query [txnlev:0] [select distinct t.triggerid,t.description,t.expression,t.priority,t.type,t.lastchange,t.value from triggers t,functions f,items i where t.triggerid=f.triggerid and f.itemid=i.itemid and t.status=0 and i.status=0 and i.state=0 and i.hostid=10105]
 13701:20160118:192200.423 In get_trigger_values()
 13701:20160118:192200.423 get_trigger_values() from:'2016.01.01 00:00:00'
 13701:20160118:192200.423 query [txnlev:0] [select value from events where source=0 and object=0 and objectid=14239 and clock<1451599200 order by eventid desc limit 1]
 13701:20160118:192200.424 End of get_trigger_values() before:0 inside:2 after:1
 13701:20160118:192200.424 In process_events() events_num:1
 13701:20160118:192200.424 In DCget_nextid() table:'events' num:1
 13701:20160118:192200.424 End of DCget_nextid() table:'events' [248:248]
 13701:20160118:192200.424 query without transaction detected
 13701:20160118:192200.424 query [txnlev:0] [insert into events (eventid,source,object,objectid,clock,ns,value) values (248,0,0,14239,1453137720,0,1);
]
 13701:20160118:192200.440 In process_actions() events_num:1
 13701:20160118:192200.440 query [txnlev:0] [select actionid,evaltype,formula from actions where status=0 and eventsource=0]
 13701:20160118:192200.441 In check_action_conditions() actionid:15
 13701:20160118:192200.441 query [txnlev:0] [select conditionid,conditiontype,operator,value from conditions where actionid=15 order by conditiontype]
 13701:20160118:192200.441 In check_action_condition() actionid:15 conditionid:33 cond.value:''
 13701:20160118:192200.441 In check_trigger_condition()
 13701:20160118:192200.441 query [txnlev:0] [select count(*) from hosts h,items i,functions f,triggers t where h.hostid=i.hostid and h.maintenance_status=0 and i.itemid=f.itemid and f.triggerid=t.triggerid and t.triggerid=14239]
 13701:20160118:192200.442 End of check_trigger_condition():SUCCEED
 13701:20160118:192200.442 End of check_action_condition():SUCCEED
 13701:20160118:192200.442 End of check_action_conditions():SUCCEED
 13701:20160118:192200.442 In DCget_nextid() table:'escalations' num:1
 13701:20160118:192200.442 End of DCget_nextid() table:'escalations' [8:8]
 13701:20160118:192200.442 query without transaction detected
 13701:20160118:192200.442 query [txnlev:0] [insert into escalations (escalationid,actionid,status,triggerid,itemid,eventid,r_eventid) values (8,15,0,14239,null,248,null);
]
 13701:20160118:192200.454 End of process_actions()
 13701:20160118:192200.454 In DBupdate_itservices()
 13701:20160118:192200.454 In its_flush_updates()
 13701:20160118:192200.454 In its_load_services_by_triggerids()
 13701:20160118:192200.454 query [txnlev:0] [select serviceid,triggerid,status,algorithm from services where triggerid=14239]
 13701:20160118:192200.454 End of its_load_services_by_triggerids()
 13701:20160118:192200.455 End of its_flush_updates():SUCCEED
 13701:20160118:192200.455 End of DBupdate_itservices():SUCCEED
 13701:20160118:192200.455 End of process_events()
 13701:20160118:192200.455 End of generate_events()
 13701:20160118:192200.455 End of update_maintenance_hosts()
 13701:20160118:192200.455 End of process_maintenance()
 13701:20160118:192200.455 timer #1 [processed 1 triggers, 0 events in 0.001975 sec, 1 maintenances in 0.041823 sec, idle 30 sec]

CONFIRMED

Comment by Oleksii Zagorskyi [ 2016 Jan 20 ]

(1) Documentation.
Until this issue is resolved, we need to add more details to current documentation to reflect important points.
Added text: "(according to actual maintenance configuration)" and some other fixes agreed with martins-v, please review and replicate to versions 2.4-2.2

martins-v Thanks, I slightly changed the added text, trying to keep the same meaning, but perhaps more precisely. Please have a look.

zalex_ua very good, thanks! Replicated to 2.4 and 2.2. 2.0 is skipped as the page is not maintained anymore.
CLOSED for now. Could be changed in future, but I believe for major version only.

Comment by Pascal Uhlmann [ 2016 Sep 30 ]

This issue also exists in 2.0.12. I'd say this behaviour should be configurable as a global setting.

Comment by Vladislavs Sokurenko [ 2017 Oct 11 ]

This should have been fixed in ZBXNEXT-3196, please let us know if issue persists, closing as duplicate

Generated at Thu Mar 28 17:28:10 EET 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.