[ZBX-11454] Force OK events to update trigger value even if there are no open problems Created: 2016 Nov 10 Updated: 2017 May 30 Resolved: 2016 Nov 17 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 3.2.1 |
Fix Version/s: | 3.2.2rc1, 3.4.0alpha1 |
Type: | Incident report | Priority: | Blocker |
Reporter: | Andris Zeila | Assignee: | Unassigned |
Resolution: | Fixed | Votes: | 0 |
Labels: | events, problems | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Issue Links: |
|
Description |
This issue must deal with the situation that occurs due to housekeeper ( There can be situation when trigger is in problem state, but there are no corresponding records in problem table. In this situation no event - recovery event links are generated and trigger value is not updated. 1) when processing normal recovery event (not created by correlation and coming from trigger with correlation_mode none) the trigger value must be updated, even if no event - recovery event links were created |
Comments |
Comment by Andris Zeila [ 2016 Nov 10 ] |
Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-11454 |
Comment by Sandis Neilands (Inactive) [ 2016 Nov 11 ] |
(1) [S] A nasty coding error - it compiles just fine. if (0 != events_num) DBbegin(); { flush_events(); update_trigger_changes(&trigger_diff); DBupdate_itservices(&trigger_diff); DCconfig_triggers_apply_changes(&trigger_diff); zbx_save_trigger_changes(&trigger_diff); DBcommit(); clean_events(); } RESOLVED in r63736. wiper Thanks, CLOSED |
Comment by Sandis Neilands (Inactive) [ 2016 Nov 16 ] |
Run trough the trigger scenarios (with and without correlation). Didn't notice anything suspicious. Valgrind: OK. |
Comment by Andris Zeila [ 2016 Nov 16 ] |
Released in:
|
Comment by Oleksii Zagorskyi [ 2016 Nov 24 ] |
A nightly build was applied on a production: # egrep "cannot find open problem|Starting" zabbix_server.log 2655:20161124:140043.608 Starting Zabbix Server. Zabbix 3.2.2rc1 (revision 63865). 4337:20161124:140132.379 cannot find open problem events for triggerid:972701, lastchange:1480003260 4301:20161124:140134.650 cannot find open problem events for triggerid:871032, lastchange:1480003276 4261:20161124:140206.925 cannot find open problem events for triggerid:671777, lastchange:1480003324 4248:20161124:140207.028 cannot find open problem events for triggerid:671906, lastchange:1480003326 4329:20161124:140324.585 cannot find open problem events for triggerid:812532, lastchange:1480003403 4329:20161124:140324.587 cannot find open problem events for triggerid:812533, lastchange:1480003403 4329:20161124:140324.587 cannot find open problem events for triggerid:812921, lastchange:1480003403 4329:20161124:140324.587 cannot find open problem events for triggerid:812922, lastchange:1480003403 4307:20161124:140356.794 cannot find open problem events for triggerid:920221, lastchange:1480003434 4256:20161124:140356.915 cannot find open problem events for triggerid:920222, lastchange:1480003434 4256:20161124:140357.482 cannot find open problem events for triggerid:920537, lastchange:1480003435 4256:20161124:140357.482 cannot find open problem events for triggerid:920538, lastchange:1480003435 4243:20161124:150330.949 cannot find open problem events for triggerid:985384, lastchange:1480007010 4243:20161124:150330.949 cannot find open problem events for triggerid:985398, lastchange:1480007010 4306:20161124:153039.766 cannot find open problem events for triggerid:915009, lastchange:1480008638 4346:20161124:153040.773 cannot find open problem events for triggerid:834416, lastchange:1480008640 4346:20161124:153040.773 cannot find open problem events for triggerid:915021, lastchange:1480008639 4278:20161124:153043.688 cannot find open problem events for triggerid:1026144, lastchange:1480008641 4345:20161124:153044.829 cannot find open problem events for triggerid:834429, lastchange:1480008642 4337:20161124:153045.809 cannot find open problem events for triggerid:915107, lastchange:1480008644 4283:20161124:153046.786 cannot find open problem events for triggerid:915119, lastchange:1480008645 4285:20161124:153047.788 cannot find open problem events for triggerid:834455, lastchange:1480008646 4243:20161124:153100.501 cannot find open problem events for triggerid:778145, lastchange:1480008660 4243:20161124:153100.597 cannot find open problem events for triggerid:874614, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:874640, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:876273, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:876285, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:897745, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:897759, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:897786, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:897800, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:897827, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:897841, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:897868, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:897882, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:899382, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:907612, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:911774, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:925282, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:925294, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:925306, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:925318, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:958031, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:962547, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:962561, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:972715, lastchange:1480008660 4243:20161124:153100.598 cannot find open problem events for triggerid:972729, lastchange:1480008660 4243:20161124:153100.696 cannot find open problem events for triggerid:1088007, lastchange:1480008660 4243:20161124:153331.427 cannot find open problem events for triggerid:915009, lastchange:1480008810 4243:20161124:153331.427 cannot find open problem events for triggerid:915021, lastchange:1480008810 4243:20161124:153400.380 cannot find open problem events for triggerid:897745, lastchange:1480008840 4243:20161124:155707.242 cannot find open problem events for triggerid:834403, lastchange:1480010220 # grep "cannot find open problem" zabbix_server.log | cut -d " " -f10 | sort | uniq -dc 2 triggerid:897745, 2 triggerid:915009, 2 triggerid:915021, first and last from the filtering: Housekeeper is configured to keep 180 days of trigger events. trying to investigate it, if possible .... |
Comment by Ram Amar [ 2016 Nov 26 ] |
Hi, --rami |
Comment by Oleksii Zagorskyi [ 2016 Nov 27 ] |
@Ram, you can try "Pre-3.2.2rc1 (stable)" taken here http://www.zabbix.com/developers and compile it to workaround, or eliminate negative effect of this issue. |
Comment by Oleksii Zagorskyi [ 2016 Dec 07 ] |
Will post in this issue because it's probable most suitable place for now. For now I see only one pattern where it's still happening - 100% if those hosts are monitored by zabbix proxy, snmp item, 60 seconds interval triggers with nodata(120) (only!) function. grep "cannot find open problem" /opt/logs/zabbix/zabbix_server.log 4243:20161201:174330.684 cannot find open problem events for triggerid:834429, lastchange: 1480621410 4243:20161206:082330.268 cannot find open problem events for triggerid:834429, lastchange: 1481019810 mysql> select *, from_unixtime(clock) from events where objectid=834429 and clock > unix_timestamp('2016-11-22') order by eventid; +-----------+--------+--------+----------+------------+-------+--------------+-----------+----------------------+ | eventid | source | object | objectid | clock | value | acknowledged | ns | from_unixtime(clock) | +-----------+--------+--------+----------+------------+-------+--------------+-----------+----------------------+ | 209905586 | 0 | 0 | 834429 | 1480621380 | 1 | 0 | 524576258 | 2016-12-01 17:43:00 | | 209905638 | 0 | 0 | 834429 | 1480621374 | 0 | 0 | 319742588 | 2016-12-01 17:42:54 | | 209905712 | 0 | 0 | 834429 | 1480621410 | 0 | 0 | 670604958 | 2016-12-01 17:43:30 | -- | 210154215 | 0 | 0 | 834429 | 1481019780 | 1 | 0 | 802815052 | 2016-12-06 08:23:00 | | 210154272 | 0 | 0 | 834429 | 1481019779 | 0 | 0 | 493398486 | 2016-12-06 08:22:59 | | 210154376 | 0 | 0 | 834429 | 1481019810 | 0 | 0 | 72114657 | 2016-12-06 08:23:30 | (pattern is 100% similar for all warning message cases ~600 times for 2 weeks, 100 different triggers, delay between problem and ok - 0-5 seconds)
mysql> select *, from_unixtime(clock) from problem where objectid=834429 and clock > unix_timestamp('2016-12-01') order by eventid;
+-----------+--------+--------+----------+------------+-----------+-----------+------------+-----------+---------------+--------+----------------------+
| eventid | source | object | objectid | clock | ns | r_eventid | r_clock | r_ns | correlationid | userid | from_unixtime(clock) |
+-----------+--------+--------+----------+------------+-----------+-----------+------------+-----------+---------------+--------+----------------------+
| 210154215 | 0 | 0 | 834429 | 1481019780 | 802815052 | 210154272 | 1481019779 | 493398486 | NULL | 0 | 2016-12-06 08:23:00 |
mysql> SELECT er.* FROM event_recovery er JOIN events e ON er.eventid=e.eventid WHERE e.objectid=834429 AND e.clock > unix_timestamp('2016-12-01') ORDER BY er.eventid;
+-----------+-----------+-----------+---------------+--------+
| eventid | r_eventid | c_eventid | correlationid | userid |
+-----------+-----------+-----------+---------------+--------+
| 210154215 | 210154272 | NULL | NULL | NULL |
(problem recovered 24h ago - already deleted here)
I see it this way: I have much more gathered info but don't want to spam it here all. (the case which was defected before applying current ZBX fix is different - zabbix agent, monitored by server directly, nodata() function. I'm not sure it worth to take it into account. I did my best to try to notice something wrong there, but I could not. Here is debug collected info as is, with notes as is): mysql> select * from problem where objectid = '13678'; <- taken 18.11.2016, now returns empty result +-----------+--------+--------+----------+------------+-----------+-----------+------------+-----------+---------------+--------+ | eventid | source | object | objectid | clock | ns | r_eventid | r_clock | r_ns | correlationid | userid | +-----------+--------+--------+----------+------------+-----------+-----------+------------+-----------+---------------+--------+ | 208812111 | 3 | 0 | 13678 | 1479429660 | 229016680 | 208824752 | 1479429960 | 257035687 | NULL | 0 | 2016-11-17 22:41:00.000000 | 208824751 | 0 | 0 | 13678 | 1479429960 | 257035687 | 208827113 | 1479429974 | 850335001 | NULL | 0 | 2016-11-17 22:46:00.000000 +-----------+--------+--------+----------+------------+-----------+-----------+------------+-----------+---------------+--------+ mysql> select * from event_recovery where eventid in (208812111, 208824751, 208824752, 208827113); +-----------+-----------+-----------+---------------+--------+ | eventid | r_eventid | c_eventid | correlationid | userid | +-----------+-----------+-----------+---------------+--------+ | 208812111 | 208824752 | NULL | NULL | NULL | | 208824751 | 208827113 | NULL | NULL | NULL | +-----------+-----------+-----------+---------------+--------+ mysql> select from_unixtime(e.clock),e.* from events e WHERE objectid = '13678' and clock > unix_timestamp('2016-11-14') and clock < unix_timestamp('2016-11-18') order by eventid ASC LIMIT 20; +------------------------+-----------+--------+--------+----------+------------+-------+--------------+-----------+ | from_unixtime(e.clock) | eventid | source | object | objectid | clock | value | acknowledged | ns | +------------------------+-----------+--------+--------+----------+------------+-------+--------------+-----------+ | 2016-11-14 11:22:00 | 208526994 | 3 | 0 | 13678 | 1479129720 | 1 | 0 | 831832815 | unk | 2016-11-14 11:27:00 | 208539500 | 0 | 0 | 13678 | 1479130020 | 1 | 0 | 873724159 | problem | 2016-11-14 11:27:00 | 208539501 | 3 | 0 | 13678 | 1479130020 | 0 | 0 | 873724159 | +normal | 2016-11-14 11:33:30 | 208564220 | 0 | 0 | 13678 | 1479130410 | 0 | 0 | 399919561 | OK - NOT an issue this time | 2016-11-16 22:07:30 | 208712922 | 3 | 0 | 13678 | 1479341250 | 1 | 0 | 365547597 | unk | 2016-11-16 22:09:31 | 208716253 | 3 | 0 | 13678 | 1479341371 | 0 | 0 | 190077982 | normal | 2016-11-17 01:31:00 | 208754642 | 3 | 0 | 13678 | 1479353460 | 1 | 0 | 803560978 | unk | 2016-11-17 01:31:38 | 208758886 | 3 | 0 | 13678 | 1479353498 | 0 | 0 | 765698080 | normal | 2016-11-17 22:41:00 | 208812111 | 3 | 0 | 13678 | 1479429660 | 1 | 0 | 229016680 | unk | 2016-11-17 22:46:00 | 208824751 | 0 | 0 | 13678 | 1479429960 | 1 | 0 | 257035687 | problem | 2016-11-17 22:46:00 | 208824752 | 3 | 0 | 13678 | 1479429960 | 0 | 0 | 257035687 | +normal | 2016-11-17 22:46:14 | 208827113 | 0 | 0 | 13678 | 1479429974 | 0 | 0 | 850335001 | OK - IS issue. why this time it's so ???? .... | 2016-11-17 22:47:00 | 208829078 | 0 | 0 | 13678 | 1479430020 | 0 | 0 | 335871815 | | 2016-11-17 22:47:30 | 208829577 | 0 | 0 | 13678 | 1479430050 | 0 | 0 | 651707751 | | 2016-11-17 22:48:00 | 208830175 | 0 | 0 | 13678 | 1479430080 | 0 | 0 | 869772675 | | 2016-11-17 22:48:30 | 208830621 | 0 | 0 | 13678 | 1479430110 | 0 | 0 | 509568386 | | 2016-11-17 22:48:57 | 208831014 | 0 | 0 | 13678 | 1479430137 | 0 | 0 | 195234921 | | 2016-11-17 22:49:00 | 208831030 | 0 | 0 | 13678 | 1479430140 | 0 | 0 | 154315250 | | 2016-11-17 22:49:30 | 208831275 | 0 | 0 | 13678 | 1479430170 | 0 | 0 | 477458468 | | 2016-11-17 22:49:49 | 208831610 | 0 | 0 | 13678 | 1479430189 | 0 | 0 | 181722451 | +------------------------+-----------+--------+--------+----------+------------+-------+--------------+-----------+ mysql> select count(*) from events where eventid between 208812000 and 208829000; +----------+ | count(*) | +----------+ | 17001 | +----------+ |
Comment by Oleksii Zagorskyi [ 2016 Dec 07 ] |
Trying to reproduce it on my test lab ... mysql> select *, from_unixtime(clock) from events; +---------+--------+--------+----------+------------+-------+--------------+-----------+----------------------+ | eventid | source | object | objectid | clock | value | acknowledged | ns | from_unixtime(clock) | +---------+--------+--------+----------+------------+-------+--------------+-----------+----------------------+ | 14 | 0 | 0 | 13570 | 1481107740 | 1 | 0 | 752350548 | 2016-12-07 12:49:00 | | 15 | 0 | 0 | 13570 | 1481107696 | 0 | 0 | 675957485 | 2016-12-07 12:48:16 | | 16 | 0 | 0 | 13570 | 1481111040 | 1 | 0 | 914455965 | 2016-12-07 13:44:00 | | 17 | 0 | 0 | 13570 | 1481110996 | 0 | 0 | 43085330 | 2016-12-07 13:43:16 | | 18 | 0 | 0 | 13570 | 1481111220 | 1 | 0 | 940446042 | 2016-12-07 13:47:00 | | 19 | 0 | 0 | 13570 | 1481111176 | 0 | 0 | 65400030 | 2016-12-07 13:46:16 | mysql> select *, from_unixtime(clock) from problem; +---------+--------+--------+----------+------------+-----------+-----------+------------+-----------+---------------+--------+----------------------+ | eventid | source | object | objectid | clock | ns | r_eventid | r_clock | r_ns | correlationid | userid | from_unixtime(clock) | +---------+--------+--------+----------+------------+-----------+-----------+------------+-----------+---------------+--------+----------------------+ | 14 | 0 | 0 | 13570 | 1481107740 | 752350548 | 15 | 1481107696 | 675957485 | NULL | 0 | 2016-12-07 12:49:00 | | 16 | 0 | 0 | 13570 | 1481111040 | 914455965 | 17 | 1481110996 | 43085330 | NULL | 0 | 2016-12-07 13:44:00 | | 18 | 0 | 0 | 13570 | 1481111220 | 940446042 | 19 | 1481111176 | 65400030 | NULL | 0 | 2016-12-07 13:47:00 | so I cannot reproduce it yet. More details about production: on the active proxy pollers spike to 100% every 10 minutes (big snmp devices are monitored there) and these spikes are exactly related to values gathered sometimes not at scheduled(calculated) time. Sometimes the delay is more than 60 seconds so single value is even skipped. Any idea how to troubleshoot this issue further ? |
Comment by Andris Zeila [ 2016 Dec 07 ] |
No, currently I have no ideas. I have been reviewing/testing the code, but cannot see how it could happen (except the obvious case when events are removed). Just to clarify - there are no event correlation used or custom patches (disabling trigger locking in configuration cache) used ? |
Comment by Oleksii Zagorskyi [ 2016 Dec 07 ] |
No, "events correlation" feature is not used at all yet. Basically 3.2 is running as is after upgrade from 3.0. |
Comment by Oleksii Zagorskyi [ 2016 Dec 12 ] |
Update: decreasing syncers to 8, NVPS is 3,5K, syncers are ~6% busy. |
Comment by Andris Zeila [ 2016 Dec 13 ] |
Just a note - it would be better to continue discussion and posting new findings in |
Comment by Oleksii Zagorskyi [ 2017 Feb 02 ] |
|