[#ZBX-6942] Unreliable trigger based on log items

[ZBX-6942] Unreliable trigger based on log items Created: 2013 Aug 30 Updated: 2019 Dec 10
Status:	Reopened
Project:	ZABBIX BUGS AND ISSUES
Component/s:	Server (S)
Affects Version/s:	2.0.7
Fix Version/s:	None

Type:

Incident report

Priority:

Trivial

Reporter:

Marc

Assignee:

Unassigned

Resolution:

Unresolved

Votes:

Labels:

events, logmonitoring, time, triggers

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original Estimate:

Not Specified

Environment:

Linux, CentOS-6.4, PostgreSQL 9.1.9, Apache HTTPD-2.2.15, PHP-5.3.3

Description

It took a long time to bring myself to create this report, because I'm not able to provide a process for reproducing this issue.

I got several triggers based on log items that seem now and then to ignore item values.

The triggers change into PROBLEM state if string "foo: not available" appears in last value.
If in PROBLEM state, the triggers change back to OK state if string "foo: available" appears.

But sometimes they stay in PROBLEM state, even though the item received a value including "foo: available".

Trigger expression example:

{TRIGGER.VALUE}=0 & {Template:logrt["\{$LOG_FILE}",foo :( not| partially)? available,ISO8859-1,21,skip].regexp("foo : not available")}=1 |
{TRIGGER.VALUE}=1 & {Template:logrt["\{$LOG_FILE}",foo :( not| partially)? available,ISO8859-1,21,skip].regexp("foo : available")}#1

Event example (no OK event):

30 Aug 2013 18:10:40    foo is not available. PROBLEM

History example (Timestamp/Local time/Value):

2013.Aug.30 18:10:42    2013.Aug.30 18:10:29    bar: available
2013.Aug.30 18:10:42    2013.Aug.30 18:10:29    bar: available
2013.Aug.30 18:10:42    2013.Aug.30 18:10:29    foo: available
2013.Aug.30 18:10:42    2013.Aug.30 18:10:29    bar: available
2013.Aug.30 18:10:41    2013.Aug.30 18:09:55    bar: not available
2013.Aug.30 18:10:40    2013.Aug.30 18:09:55    bar: not available
2013.Aug.30 18:10:40    2013.Aug.30 18:09:49    foo: not available
2013.Aug.30 18:10:40    2013.Aug.30 18:09:49    bar: partially available
2013.Aug.30 18:09:38    2013.Aug.30 18:09:26    bar: not available
2013.Aug.30 18:09:38    2013.Aug.30 18:09:26    bar: not available
2013.Aug.30 18:08:32    2013.Aug.30 18:08:02    bar: available

The only common thing I noticed (for this particular example) is that it happened if between corresponding "not available" / "available" were 30 secs but values were processed in very quick succession.

Comments

Comment by Marc [ 2013 Aug 30 ]

Replace:
"[...]processed in one history sync."

by:
"[...]processed in very quick succession."

glebs.ivanovskis Done. Also fiddled with formatting if you don't mind...

Comment by Oleksii Zagorskyi [ 2013 Aug 31 ]

What is an update interval for this item ?
Does trigger have Multiple problem generation ?

Show also please created events for that time frame.

Why you don't use brackets to divide the expression to two logical parts ?
As shown here https://www.zabbix.com/documentation/2.2/manual/config/triggers/expression#hysteresis
I'm not sure it has some reason for your case, but it would be better to follow doc recommendations.
Otherwise I'm not sure in which order server should calculate sort of "1&0|0&1" boolean.

Also interesting point for these two lines:

(Timestamp/          Local time/          Value)
2013.Aug.30 18:10:41 2013.Aug.30 18:09:55 bar: not available 
2013.Aug.30 18:10:40 2013.Aug.30 18:09:55 bar: not available

We see the Local time (time from the actual log) the same, so supposedly the lines were grabbed by agent in one go and supposedly sent also in single packet (and two lines below mentioned ones as well).
Why we see different seconds for Timestamp (40,41) ?
As I know zabbix_server's Timestamp assigned to a value when it comes to a history cache (i.e. when value received by a trapper).
It's a bit strange for me.

Note: you could update issue description yourself instead of the comment above.

Comment by Marc [ 2013 Aug 31 ]

Yeah, we briefly discussed the 'Edit' issue before. Do you remember (~~ZBX-6496~~)?
My menue looks like this:
[Comment] [Attach Files] [Attach Screenshot] [More Actions v] [Confirm] [Need info] [Workflow v]

Backt ot the actually issue:

The update interval is set to 60 seconds. Since log entries in question came in a bulk within 8 sec, I didn't thought it could be interval related. Aynhow, I believe I played with interval in the past as well - not sure.

There is only the PROBLEM event for that time frame - that's the problem. There exist events for the past, but they are days older.

The triggers have 'Multiple PROBLEM events generation' not set.

I omitted the brackets because I expected them to be not necessary due to operator precedence.

Local time is actually not the same. Date formatting just doesn't allow a greater detail.
Up to know I thought 'Timestamp' is set individually on every insert. I'll check that.

Some additional information about the example in this issue.
The item collects availability messages for several "foobar's" (>50).
Here "foo" is unique and "bar's" might be different ones - just don't think about the "bar's", except that there are other values beside "foo" and their corersponding timestamps.

Edit:
I started yesterday to inspect the source code, suspecting expression evaluation order.
~~I stopped at zbx_vector_ptr_create(&trigger_order), which definition I didn't found.~~

Edit:
please ignore my previous comment. Sometimes it helps having slept over it

Comment by Marc [ 2013 Aug 31 ]

This expression is another example:

{Template:log["{$LOG_FILE}",FOO BAR,ISO8859-1,1,skip].str(BEGIN)}=1

The corresponding item gets "FOO BAR BEGIN" and "FOO BAR END" values.
This trigger is assigned to over 1k hosts. Once a day the log is written at similar times. Every day the trigger fails for a couple of hosts (~10-30) - mostly different ones

Comment by Marc [ 2013 Nov 06 ]

Seems not to happen anymore since updating to 2.0.9.

Comment by richlv [ 2013 Nov 06 ]

reopen to clear 'fix version'

Comment by Marc [ 2013 Nov 08 ]

I've been too early pleased.
The latter (BEGIN/END thing) just happens again.
The first one just happens as well

Comment by Exploit Natixis [ 2013 Dec 18 ]

I suspect I got the same problem but with a zabbix trapper, items with different values are generally generated every 5mn and trigger changes status properly. One time, the delay was 15s between two value and the trigger didn't change its status.
Env Zabbix 2.0.9

Alain

Comment by Marc [ 2014 Mar 22 ]

This issue still exists in release 2.2.2

Comment by Marc [ 2014 Jul 07 ]

Apparently didn't happen anymore since updating from 2.2.2 to 2.2.4.
However, this time I'll wait a bit longer until considering this issue being fixed

Comment by Marc [ 2014 Jul 17 ]

Appears to be fixed.

Comment by Oleksii Zagorskyi [ 2014 Jul 17 ]

There were no fixes in zabbix code, issue reopened and closed as cannot reproduce.

Comment by Marc [ 2014 Jul 17 ]

Well, there was one that's for sure. But I can't tell which one
Anyway I'm fine with that.

It's the first time that I can make seriously use of log monitoring functionality in Zabbix and I'm pretty happy about that

Comment by Marc [ 2015 Mar 28 ]

Notice this issue again regularly.
Seams like happening when both item values (OK/PROBLEM) occur for the same time (second).

Edit:

zabbix=# SELECT id,
       clock,
       ns,
       timestamp
FROM history_log
WHERE itemid = 428480
  AND clock = 1427506557
ORDER BY timestamp;
    id     |   clock    |    ns     | timestamp
-----------+------------+-----------+------------
 930030690 | 1427506557 | 114980302 | 1427506503
 930030713 | 1427506557 | 114431409 | 1427506520
(2 rows)

zabbix=#

Judging id and timestamp item values got obviously created in the right chronological order. The ns value looks suspicious though...

Comment by Oleksii Zagorskyi [ 2015 Mar 29 ]

Indeed the ns order looks strange.

Just FYI: According to ~~ZBXNEXT-457~~, zabbix agent should send "ns" for every value.
For older agents (lower 2.0) the ns (if missing) will be generated by server.

What is agent version and agent's OS?
What is the item key and update interval?

Just in case need to remember about ~~ZBX-6353~~ implemented recently for 2.4.2

Comment by Marc [ 2015 Mar 29 ]

Just took some random samples (out of 60 from last night):

2.0.10, 2.2.0, 2.2.7, 2.4.2

AIX 6100-09-03-1415, CentOS release 5.11 (Final), CentOS release 6.6 (Final), CentOS Linux release 7.0.1406 (Core)

Btw, Zabbix server is 2.2.9 and proxies should largely be 2.2.8.
As for the item key see a previous comment. The item interval is set to 60 seconds.

Another suspicious detail might be that the issue apparently happens when the both values (BEGIN/END, resp PROBLEM/OK) occur within 30 seconds. But that needs to be double-checked whether it's indeed the case without any exception.

Edit:
So far it seams this issue occurs again since updated to 2.2.8.

Comment by Glebs Ivanovskis (Inactive) [ 2016 May 24 ]

okkuv9xh, how actively "problematic" agents communicate with server/proxy?

There is a chance that log lines A and B (appearing in log in this order) will be sent by agent not in one buffer sending operation, but in two. There is a chance that TCP connection transmitting B will be accepted and processed by server/proxy earlier than connection transmitting A. This way we will get clock&ns(A)>clock&ns(B) and timestamp(A)<timestamp(B) like you pasted above.

Is there a still chance to get hold of respective entries from events table?

Comment by Glebs Ivanovskis (Inactive) [ 2017 Mar 30 ]

Or the problem can be in proxy_timediff and slightly fluctuating clocks as in ~~ZBX-9939~~.

Generated at Sat Aug 02 10:52:04 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.

[ZBX-6942] Unreliable trigger based on log items Created: 2013 Aug 30 Updated: 2019 Dec 10