[ZBX-6942] Unreliable trigger based on log items Created: 2013 Aug 30 Updated: 2019 Dec 10 |
|
Status: | Reopened |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 2.0.7 |
Fix Version/s: | None |
Type: | Incident report | Priority: | Trivial |
Reporter: | Marc | Assignee: | Unassigned |
Resolution: | Unresolved | Votes: | 0 |
Labels: | events, logmonitoring, time, triggers | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified | ||
Environment: |
Linux, CentOS-6.4, PostgreSQL 9.1.9, Apache HTTPD-2.2.15, PHP-5.3.3 |
Description |
It took a long time to bring myself to create this report, because I'm not able to provide a process for reproducing this issue. I got several triggers based on log items that seem now and then to ignore item values. The triggers change into PROBLEM state if string "foo: not available" appears in last value. But sometimes they stay in PROBLEM state, even though the item received a value including "foo: available". Trigger expression example: {TRIGGER.VALUE}=0 & {Template:logrt["\{$LOG_FILE}",foo :( not| partially)? available,ISO8859-1,21,skip].regexp("foo : not available")}=1 | {TRIGGER.VALUE}=1 & {Template:logrt["\{$LOG_FILE}",foo :( not| partially)? available,ISO8859-1,21,skip].regexp("foo : available")}#1 Event example (no OK event): 30 Aug 2013 18:10:40 foo is not available. PROBLEM History example (Timestamp/Local time/Value): 2013.Aug.30 18:10:42 2013.Aug.30 18:10:29 bar: available 2013.Aug.30 18:10:42 2013.Aug.30 18:10:29 bar: available 2013.Aug.30 18:10:42 2013.Aug.30 18:10:29 foo: available 2013.Aug.30 18:10:42 2013.Aug.30 18:10:29 bar: available 2013.Aug.30 18:10:41 2013.Aug.30 18:09:55 bar: not available 2013.Aug.30 18:10:40 2013.Aug.30 18:09:55 bar: not available 2013.Aug.30 18:10:40 2013.Aug.30 18:09:49 foo: not available 2013.Aug.30 18:10:40 2013.Aug.30 18:09:49 bar: partially available 2013.Aug.30 18:09:38 2013.Aug.30 18:09:26 bar: not available 2013.Aug.30 18:09:38 2013.Aug.30 18:09:26 bar: not available 2013.Aug.30 18:08:32 2013.Aug.30 18:08:02 bar: available The only common thing I noticed (for this particular example) is that it happened if between corresponding "not available" / "available" were 30 secs but values were processed in very quick succession. |
Comments |
Comment by Marc [ 2013 Aug 30 ] |
Replace: by: glebs.ivanovskis Done. Also fiddled with formatting if you don't mind... |
Comment by Oleksii Zagorskyi [ 2013 Aug 31 ] |
What is an update interval for this item ? Show also please created events for that time frame. Why you don't use brackets to divide the expression to two logical parts ? Also interesting point for these two lines: (Timestamp/ Local time/ Value) 2013.Aug.30 18:10:41 2013.Aug.30 18:09:55 bar: not available 2013.Aug.30 18:10:40 2013.Aug.30 18:09:55 bar: not available We see the Local time (time from the actual log) the same, so supposedly the lines were grabbed by agent in one go and supposedly sent also in single packet (and two lines below mentioned ones as well). Note: you could update issue description yourself instead of the comment above. |
Comment by Marc [ 2013 Aug 31 ] |
Yeah, we briefly discussed the 'Edit' issue before. Do you remember Backt ot the actually issue: The update interval is set to 60 seconds. Since log entries in question came in a bulk within 8 sec, I didn't thought it could be interval related. Aynhow, I believe I played with interval in the past as well - not sure. There is only the PROBLEM event for that time frame - that's the problem. There exist events for the past, but they are days older. The triggers have 'Multiple PROBLEM events generation' not set. I omitted the brackets because I expected them to be not necessary due to operator precedence. Local time is actually not the same. Date formatting just doesn't allow a greater detail. Some additional information about the example in this issue. Edit: Edit: |
Comment by Marc [ 2013 Aug 31 ] |
This expression is another example:
{Template:log["{$LOG_FILE}",FOO BAR,ISO8859-1,1,skip].str(BEGIN)}=1
The corresponding item gets "FOO BAR BEGIN" and "FOO BAR END" values. |
Comment by Marc [ 2013 Nov 06 ] |
Seems not to happen anymore since updating to 2.0.9. |
Comment by richlv [ 2013 Nov 06 ] |
reopen to clear 'fix version' |
Comment by Marc [ 2013 Nov 08 ] |
I've been too early pleased. |
Comment by Exploit Natixis [ 2013 Dec 18 ] |
I suspect I got the same problem but with a zabbix trapper, items with different values are generally generated every 5mn and trigger changes status properly. One time, the delay was 15s between two value and the trigger didn't change its status. Alain |
Comment by Marc [ 2014 Mar 22 ] |
This issue still exists in release 2.2.2 |
Comment by Marc [ 2014 Jul 07 ] |
Apparently didn't happen anymore since updating from 2.2.2 to 2.2.4. |
Comment by Marc [ 2014 Jul 17 ] |
Appears to be fixed. |
Comment by Oleksii Zagorskyi [ 2014 Jul 17 ] |
There were no fixes in zabbix code, issue reopened and closed as cannot reproduce. |
Comment by Marc [ 2014 Jul 17 ] |
Well, there was one that's for sure. But I can't tell which one It's the first time that I can make seriously use of log monitoring functionality in Zabbix and I'm pretty happy about that |
Comment by Marc [ 2015 Mar 28 ] |
Notice this issue again regularly. Edit: zabbix=# SELECT id, clock, ns, timestamp FROM history_log WHERE itemid = 428480 AND clock = 1427506557 ORDER BY timestamp; id | clock | ns | timestamp -----------+------------+-----------+------------ 930030690 | 1427506557 | 114980302 | 1427506503 930030713 | 1427506557 | 114431409 | 1427506520 (2 rows) zabbix=# Judging id and timestamp item values got obviously created in the right chronological order. The ns value looks suspicious though... |
Comment by Oleksii Zagorskyi [ 2015 Mar 29 ] |
Indeed the ns order looks strange. Just FYI: According to What is agent version and agent's OS? Just in case need to remember about |
Comment by Marc [ 2015 Mar 29 ] |
Just took some random samples (out of 60 from last night): 2.0.10, 2.2.0, 2.2.7, 2.4.2 AIX 6100-09-03-1415, CentOS release 5.11 (Final), CentOS release 6.6 (Final), CentOS Linux release 7.0.1406 (Core) Btw, Zabbix server is 2.2.9 and proxies should largely be 2.2.8. Another suspicious detail might be that the issue apparently happens when the both values (BEGIN/END, resp PROBLEM/OK) occur within 30 seconds. But that needs to be double-checked whether it's indeed the case without any exception. Edit: |
Comment by Glebs Ivanovskis (Inactive) [ 2016 May 24 ] |
okkuv9xh, how actively "problematic" agents communicate with server/proxy? There is a chance that log lines A and B (appearing in log in this order) will be sent by agent not in one buffer sending operation, but in two. There is a chance that TCP connection transmitting B will be accepted and processed by server/proxy earlier than connection transmitting A. This way we will get clock&ns(A)>clock&ns(B) and timestamp(A)<timestamp(B) like you pasted above. Is there a still chance to get hold of respective entries from events table? |
Comment by Glebs Ivanovskis (Inactive) [ 2017 Mar 30 ] |
Or the problem can be in proxy_timediff and slightly fluctuating clocks as in |