[ZBX-11454] Force OK events to update trigger value even if there are no open problems Created: 2016 Nov 10  Updated: 2017 May 30  Resolved: 2016 Nov 17

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 3.2.1
Fix Version/s: 3.2.2rc1, 3.4.0alpha1

Type: Incident report Priority: Blocker
Reporter: Andris Zeila Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: events, problems
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by ZBX-11439 3.2 server could not close PROBLEM ev... Closed

 Description   

This issue must deal with the situation that occurs due to housekeeper (ZBX-11426) and currently unknown (ZBX-11439) bugs.

There can be situation when trigger is in problem state, but there are no corresponding records in problem table. In this situation no event - recovery event links are generated and trigger value is not updated.

1) when processing normal recovery event (not created by correlation and coming from trigger with correlation_mode none) the trigger value must be updated, even if no event - recovery event links were created
2) when processing OK event a warning must be logged if trigger in problem state doesn't have problem records.



 Comments   
Comment by Andris Zeila [ 2016 Nov 10 ]

Fixed in development branch svn://svn.zabbix.com/branches/dev/ZBX-11454

Comment by Sandis Neilands (Inactive) [ 2016 Nov 11 ]

(1) [S] A nasty coding error - it compiles just fine.

        if (0 != events_num)

                DBbegin();
{
                flush_events();
                update_trigger_changes(&trigger_diff);
                DBupdate_itservices(&trigger_diff);

                DCconfig_triggers_apply_changes(&trigger_diff);
                zbx_save_trigger_changes(&trigger_diff);

                DBcommit();

                clean_events();
        }

RESOLVED in r63736.

wiper Thanks, CLOSED

Comment by Sandis Neilands (Inactive) [ 2016 Nov 16 ]

Run trough the trigger scenarios (with and without correlation). Didn't notice anything suspicious.

Valgrind: OK.
Code review: OK.
Manual testing (MySQL): OK.

Comment by Andris Zeila [ 2016 Nov 16 ]

Released in:

  • pre-3.2.2rc1 r63809
  • pre-3.3.0 63810
Comment by Oleksii Zagorskyi [ 2016 Nov 24 ]

A nightly build was applied on a production:

# egrep "cannot find open problem|Starting" zabbix_server.log
  2655:20161124:140043.608 Starting Zabbix Server. Zabbix 3.2.2rc1 (revision 63865).
  4337:20161124:140132.379 cannot find open problem events for triggerid:972701, lastchange:1480003260
  4301:20161124:140134.650 cannot find open problem events for triggerid:871032, lastchange:1480003276
  4261:20161124:140206.925 cannot find open problem events for triggerid:671777, lastchange:1480003324
  4248:20161124:140207.028 cannot find open problem events for triggerid:671906, lastchange:1480003326
  4329:20161124:140324.585 cannot find open problem events for triggerid:812532, lastchange:1480003403
  4329:20161124:140324.587 cannot find open problem events for triggerid:812533, lastchange:1480003403
  4329:20161124:140324.587 cannot find open problem events for triggerid:812921, lastchange:1480003403
  4329:20161124:140324.587 cannot find open problem events for triggerid:812922, lastchange:1480003403
  4307:20161124:140356.794 cannot find open problem events for triggerid:920221, lastchange:1480003434
  4256:20161124:140356.915 cannot find open problem events for triggerid:920222, lastchange:1480003434
  4256:20161124:140357.482 cannot find open problem events for triggerid:920537, lastchange:1480003435
  4256:20161124:140357.482 cannot find open problem events for triggerid:920538, lastchange:1480003435
  4243:20161124:150330.949 cannot find open problem events for triggerid:985384, lastchange:1480007010
  4243:20161124:150330.949 cannot find open problem events for triggerid:985398, lastchange:1480007010
  4306:20161124:153039.766 cannot find open problem events for triggerid:915009, lastchange:1480008638
  4346:20161124:153040.773 cannot find open problem events for triggerid:834416, lastchange:1480008640
  4346:20161124:153040.773 cannot find open problem events for triggerid:915021, lastchange:1480008639
  4278:20161124:153043.688 cannot find open problem events for triggerid:1026144, lastchange:1480008641
  4345:20161124:153044.829 cannot find open problem events for triggerid:834429, lastchange:1480008642
  4337:20161124:153045.809 cannot find open problem events for triggerid:915107, lastchange:1480008644
  4283:20161124:153046.786 cannot find open problem events for triggerid:915119, lastchange:1480008645
  4285:20161124:153047.788 cannot find open problem events for triggerid:834455, lastchange:1480008646
  4243:20161124:153100.501 cannot find open problem events for triggerid:778145, lastchange:1480008660
  4243:20161124:153100.597 cannot find open problem events for triggerid:874614, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:874640, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:876273, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:876285, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:897745, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:897759, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:897786, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:897800, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:897827, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:897841, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:897868, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:897882, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:899382, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:907612, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:911774, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:925282, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:925294, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:925306, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:925318, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:958031, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:962547, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:962561, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:972715, lastchange:1480008660
  4243:20161124:153100.598 cannot find open problem events for triggerid:972729, lastchange:1480008660
  4243:20161124:153100.696 cannot find open problem events for triggerid:1088007, lastchange:1480008660
  4243:20161124:153331.427 cannot find open problem events for triggerid:915009, lastchange:1480008810
  4243:20161124:153331.427 cannot find open problem events for triggerid:915021, lastchange:1480008810
  4243:20161124:153400.380 cannot find open problem events for triggerid:897745, lastchange:1480008840
  4243:20161124:155707.242 cannot find open problem events for triggerid:834403, lastchange:1480010220
  
# grep "cannot find open problem" zabbix_server.log | cut -d " " -f10 | sort | uniq -dc
      2 triggerid:897745,
      2 triggerid:915009,
      2 triggerid:915021,

first and last from the filtering:
1480003260 - Thu, 24 Nov 2016 16:01:00 GMT
1480010220 - Thu, 24 Nov 2016 17:57:00 GMT
i.e. very fresh time stamps during a few hours from now.

Housekeeper is configured to keep 180 days of trigger events.

trying to investigate it, if possible ....

Comment by Ram Amar [ 2016 Nov 26 ]

Hi,
Amazing to see how fast you fixed this bug
Is there a work around, for the people suffering from it in the mean time ?

--rami

Comment by Oleksii Zagorskyi [ 2016 Nov 27 ]

@Ram, you can try "Pre-3.2.2rc1 (stable)" taken here http://www.zabbix.com/developers and compile it to workaround, or eliminate negative effect of this issue.
If your system is affected by the unknown issue, you should start to see those "cannot find open problem events for triggerid" warnings in server log.
If it still happens for you - would be nice if you try to deep into the issue to try to figure out the original reason.

Comment by Oleksii Zagorskyi [ 2016 Dec 07 ]

Will post in this issue because it's probable most suitable place for now.

For now I see only one pattern where it's still happening - 100% if those hosts are monitored by zabbix proxy, snmp item, 60 seconds interval triggers with nodata(120) (only!) function.
Warning messages generated by timer in 90% of cases, rest - by syncers.

grep "cannot find open problem" /opt/logs/zabbix/zabbix_server.log
  4243:20161201:174330.684 cannot find open problem events for triggerid:834429, lastchange: 1480621410
  4243:20161206:082330.268 cannot find open problem events for triggerid:834429, lastchange: 1481019810
  
mysql> select *, from_unixtime(clock) from events where objectid=834429 and clock > unix_timestamp('2016-11-22') order by eventid;
+-----------+--------+--------+----------+------------+-------+--------------+-----------+----------------------+
| eventid   | source | object | objectid | clock      | value | acknowledged | ns        | from_unixtime(clock) |
+-----------+--------+--------+----------+------------+-------+--------------+-----------+----------------------+
| 209905586 |      0 |      0 |   834429 | 1480621380 |     1 |            0 | 524576258 | 2016-12-01 17:43:00  |
| 209905638 |      0 |      0 |   834429 | 1480621374 |     0 |            0 | 319742588 | 2016-12-01 17:42:54  |
| 209905712 |      0 |      0 |   834429 | 1480621410 |     0 |            0 | 670604958 | 2016-12-01 17:43:30  |
--
| 210154215 |      0 |      0 |   834429 | 1481019780 |     1 |            0 | 802815052 | 2016-12-06 08:23:00  |
| 210154272 |      0 |      0 |   834429 | 1481019779 |     0 |            0 | 493398486 | 2016-12-06 08:22:59  |
| 210154376 |      0 |      0 |   834429 | 1481019810 |     0 |            0 |  72114657 | 2016-12-06 08:23:30  |
(pattern is 100% similar for all warning message cases ~600 times for 2 weeks, 100 different triggers, delay between problem and ok - 0-5 seconds)
mysql> select *, from_unixtime(clock) from problem where objectid=834429 and clock > unix_timestamp('2016-12-01') order by eventid;
+-----------+--------+--------+----------+------------+-----------+-----------+------------+-----------+---------------+--------+----------------------+
| eventid   | source | object | objectid | clock      | ns        | r_eventid | r_clock    | r_ns      | correlationid | userid | from_unixtime(clock) |
+-----------+--------+--------+----------+------------+-----------+-----------+------------+-----------+---------------+--------+----------------------+
| 210154215 |      0 |      0 |   834429 | 1481019780 | 802815052 | 210154272 | 1481019779 | 493398486 |          NULL |      0 | 2016-12-06 08:23:00  |
mysql> SELECT er.* FROM event_recovery er JOIN events e ON er.eventid=e.eventid WHERE e.objectid=834429 AND e.clock > unix_timestamp('2016-12-01') ORDER BY er.eventid;
+-----------+-----------+-----------+---------------+--------+
| eventid   | r_eventid | c_eventid | correlationid | userid |
+-----------+-----------+-----------+---------------+--------+
| 210154215 | 210154272 |      NULL |          NULL |   NULL |
(problem recovered 24h ago - already deleted here)

I see it this way:
timer generates a PROBLEM (as it should), a second+ later syncer processes data from proxy (delayed by 1-5 seconds) and closes the PROBLEM but does NOT switch trigger to OK.
After 30 seconds timer sees the trigger still in PROBLEM and tries to close it, but could not find opened problem.

I have much more gathered info but don't want to spam it here all.
Let me know if you need more details.

(the case which was defected before applying current ZBX fix is different - zabbix agent, monitored by server directly, nodata() function. I'm not sure it worth to take it into account. I did my best to try to notice something wrong there, but I could not. Here is debug collected info as is, with notes as is):

mysql> select * from problem where objectid = '13678';  <- taken 18.11.2016, now returns empty result 
+-----------+--------+--------+----------+------------+-----------+-----------+------------+-----------+---------------+--------+
| eventid   | source | object | objectid | clock      | ns        | r_eventid | r_clock    | r_ns      | correlationid | userid |
+-----------+--------+--------+----------+------------+-----------+-----------+------------+-----------+---------------+--------+
| 208812111 |      3 |      0 |    13678 | 1479429660 | 229016680 | 208824752 | 1479429960 | 257035687 |          NULL |      0 |  2016-11-17 22:41:00.000000
| 208824751 |      0 |      0 |    13678 | 1479429960 | 257035687 | 208827113 | 1479429974 | 850335001 |          NULL |      0 |  2016-11-17 22:46:00.000000
+-----------+--------+--------+----------+------------+-----------+-----------+------------+-----------+---------------+--------+

mysql> select * from event_recovery where eventid in (208812111, 208824751, 208824752, 208827113);
+-----------+-----------+-----------+---------------+--------+
| eventid   | r_eventid | c_eventid | correlationid | userid |
+-----------+-----------+-----------+---------------+--------+
| 208812111 | 208824752 |      NULL |          NULL |   NULL |
| 208824751 | 208827113 |      NULL |          NULL |   NULL |
+-----------+-----------+-----------+---------------+--------+


mysql> select from_unixtime(e.clock),e.* from events e WHERE objectid = '13678' and clock > unix_timestamp('2016-11-14') and clock < unix_timestamp('2016-11-18') order by eventid ASC LIMIT 20;
+------------------------+-----------+--------+--------+----------+------------+-------+--------------+-----------+
| from_unixtime(e.clock) | eventid   | source | object | objectid | clock      | value | acknowledged | ns        |
+------------------------+-----------+--------+--------+----------+------------+-------+--------------+-----------+
| 2016-11-14 11:22:00    | 208526994 |      3 |      0 |    13678 | 1479129720 |     1 |            0 | 831832815 | unk
| 2016-11-14 11:27:00    | 208539500 |      0 |      0 |    13678 | 1479130020 |     1 |            0 | 873724159 | problem
| 2016-11-14 11:27:00    | 208539501 |      3 |      0 |    13678 | 1479130020 |     0 |            0 | 873724159 | +normal
| 2016-11-14 11:33:30    | 208564220 |      0 |      0 |    13678 | 1479130410 |     0 |            0 | 399919561 | OK  - NOT an issue this time
| 2016-11-16 22:07:30    | 208712922 |      3 |      0 |    13678 | 1479341250 |     1 |            0 | 365547597 | unk
| 2016-11-16 22:09:31    | 208716253 |      3 |      0 |    13678 | 1479341371 |     0 |            0 | 190077982 | normal
| 2016-11-17 01:31:00    | 208754642 |      3 |      0 |    13678 | 1479353460 |     1 |            0 | 803560978 | unk
| 2016-11-17 01:31:38    | 208758886 |      3 |      0 |    13678 | 1479353498 |     0 |            0 | 765698080 | normal
| 2016-11-17 22:41:00    | 208812111 |      3 |      0 |    13678 | 1479429660 |     1 |            0 | 229016680 | unk
| 2016-11-17 22:46:00    | 208824751 |      0 |      0 |    13678 | 1479429960 |     1 |            0 | 257035687 | problem
| 2016-11-17 22:46:00    | 208824752 |      3 |      0 |    13678 | 1479429960 |     0 |            0 | 257035687 | +normal 
| 2016-11-17 22:46:14    | 208827113 |      0 |      0 |    13678 | 1479429974 |     0 |            0 | 850335001 | OK  - IS issue. why this time it's so ???? ....
| 2016-11-17 22:47:00    | 208829078 |      0 |      0 |    13678 | 1479430020 |     0 |            0 | 335871815 |
| 2016-11-17 22:47:30    | 208829577 |      0 |      0 |    13678 | 1479430050 |     0 |            0 | 651707751 |
| 2016-11-17 22:48:00    | 208830175 |      0 |      0 |    13678 | 1479430080 |     0 |            0 | 869772675 |
| 2016-11-17 22:48:30    | 208830621 |      0 |      0 |    13678 | 1479430110 |     0 |            0 | 509568386 |
| 2016-11-17 22:48:57    | 208831014 |      0 |      0 |    13678 | 1479430137 |     0 |            0 | 195234921 |
| 2016-11-17 22:49:00    | 208831030 |      0 |      0 |    13678 | 1479430140 |     0 |            0 | 154315250 |
| 2016-11-17 22:49:30    | 208831275 |      0 |      0 |    13678 | 1479430170 |     0 |            0 | 477458468 |
| 2016-11-17 22:49:49    | 208831610 |      0 |      0 |    13678 | 1479430189 |     0 |            0 | 181722451 |
+------------------------+-----------+--------+--------+----------+------------+-------+--------------+-----------+

mysql> select count(*) from events where eventid between  208812000 and 208829000;
+----------+
| count(*) |
+----------+
|    17001 |
+----------+
Comment by Oleksii Zagorskyi [ 2016 Dec 07 ]

Trying to reproduce it on my test lab ...
I was able to emulate those events generation order by manual breaking communication between active proxy and server, but trigger was returned to OK by syncer when communication is restored, so timer in next activity did not generate the warning.

mysql> select *, from_unixtime(clock) from events;
+---------+--------+--------+----------+------------+-------+--------------+-----------+----------------------+
| eventid | source | object | objectid | clock      | value | acknowledged | ns        | from_unixtime(clock) |
+---------+--------+--------+----------+------------+-------+--------------+-----------+----------------------+
|      14 |      0 |      0 |    13570 | 1481107740 |     1 |            0 | 752350548 | 2016-12-07 12:49:00  |
|      15 |      0 |      0 |    13570 | 1481107696 |     0 |            0 | 675957485 | 2016-12-07 12:48:16  |
|      16 |      0 |      0 |    13570 | 1481111040 |     1 |            0 | 914455965 | 2016-12-07 13:44:00  |
|      17 |      0 |      0 |    13570 | 1481110996 |     0 |            0 |  43085330 | 2016-12-07 13:43:16  |
|      18 |      0 |      0 |    13570 | 1481111220 |     1 |            0 | 940446042 | 2016-12-07 13:47:00  |
|      19 |      0 |      0 |    13570 | 1481111176 |     0 |            0 |  65400030 | 2016-12-07 13:46:16  |

mysql> select *, from_unixtime(clock) from problem;
+---------+--------+--------+----------+------------+-----------+-----------+------------+-----------+---------------+--------+----------------------+
| eventid | source | object | objectid | clock      | ns        | r_eventid | r_clock    | r_ns      | correlationid | userid | from_unixtime(clock) |
+---------+--------+--------+----------+------------+-----------+-----------+------------+-----------+---------------+--------+----------------------+
|      14 |      0 |      0 |    13570 | 1481107740 | 752350548 |        15 | 1481107696 | 675957485 |          NULL |      0 | 2016-12-07 12:49:00  |
|      16 |      0 |      0 |    13570 | 1481111040 | 914455965 |        17 | 1481110996 |  43085330 |          NULL |      0 | 2016-12-07 13:44:00  |
|      18 |      0 |      0 |    13570 | 1481111220 | 940446042 |        19 | 1481111176 |  65400030 |          NULL |      0 | 2016-12-07 13:47:00  |

so I cannot reproduce it yet.

More details about production: on the active proxy pollers spike to 100% every 10 minutes (big snmp devices are monitored there) and these spikes are exactly related to values gathered sometimes not at scheduled(calculated) time. Sometimes the delay is more than 60 seconds so single value is even skipped.
Data sender spikes to 100% every 30 minutes (but it's not related to config sync).
One important point - the zabbix server has StartDBSyncers=100 (yes - 100!) history syncers started, there is/was some reason for that, but it will be reviewed.

Any idea how to troubleshoot this issue further ?

Comment by Andris Zeila [ 2016 Dec 07 ]

No, currently I have no ideas. I have been reviewing/testing the code, but cannot see how it could happen (except the obvious case when events are removed). Just to clarify - there are no event correlation used or custom patches (disabling trigger locking in configuration cache) used ?

Comment by Oleksii Zagorskyi [ 2016 Dec 07 ]

No, "events correlation" feature is not used at all yet. Basically 3.2 is running as is after upgrade from 3.0.
"correlation" table is empty (I suppose it's correct point to check).
I guess no any specific patches used.

Comment by Oleksii Zagorskyi [ 2016 Dec 12 ]

Update: decreasing syncers to 8, NVPS is 3,5K, syncers are ~6% busy.
The issue is still repeating. In 100% cases - hosts monitored by one proxy, nodata() function used, data came to server with a delay, i.e. eventids and their clock are reversed.
There are a looooot of cases when the data is delayed (pollers on the proxy spike to 100% periodically because of big snmp devices) and those events order happen, but only some such cases cause the warning appearing - 29 ones for 4 days server uptime.

Comment by Andris Zeila [ 2016 Dec 13 ]

Just a note - it would be better to continue discussion and posting new findings in ZBX-11439. This issue was created only to deal with the situation (which partly can happen 'officially' because of housekeeper). If there are event warnings reported in log, but no trigger is left stuck in problem state - then the fix is working fine. Now we have only to discover the possible causes - and ZBX-11439 is left open for that.

Comment by Oleksii Zagorskyi [ 2017 Feb 02 ]

ZBX-11768 has been created to continue discussion of the original issue.

Generated at Thu May 09 02:41:30 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.