[ZBXNEXT-3274] Event correlation on trigger level Created: 2016 May 10  Updated: 2024 Apr 10  Resolved: 2017 Feb 28

Status: Closed
Project: ZABBIX FEATURE REQUESTS
Component/s: API (A), Frontend (F), Server (S)
Affects Version/s: None
Fix Version/s: 3.2.0alpha1

Type: New Feature Request Priority: Major
Reporter: Alexei Vladishev Assignee: Unassigned
Resolution: Fixed Votes: 1
Labels: event, eventcorrelation, trigger
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File long_tags.png     File trigger_evaluation_fix.diff    
Team: Team C
Sprint: Sprint 1, Sprint 2

 Description   

Having event tags implemented (ZBXNEXT-2087) allows whole new range of possibilities. One of them could be ability to close problem only in case if there is a matched tag.

It would allow to close problems discovered by log monitoring individually based on event tag data.



 Comments   
Comment by Marc [ 2016 May 10 ]

The application scenario is not clear to me yet, can you please elaborate on the use case?
Since the spec of ZBXNEXT-2087 isn't published yet, it's also not obvious to me how it is going to be implemented and thus could be further used for

By "[...] close problem [...]", do you mean generating an OK event or is it meant to stop an active escalation?

Comment by Alexei Vladishev [ 2016 May 13 ]

A typical use case might be log monitoring with multiple event generation enabled. It basically gives us a choice how to handle OK event: close all problems (as it is now) or try to correlate (i.e. find corresponding PROBLEM event(s)) and close only those problems that having matching tag value.

Suppose you have a log file:

Apache stopped <- PROBLEM event "Apache is not available" is generated
Oracle stopped <- PROBLEM event "Oracle is not available" is generated
Apache started <- it will close only "Apache is not available" (provided we have tag Application defined and both OK&PROBLEM events have Application=Apache )
Oracle started <- it will close "Oracle is not available"

I'm not sure if I'm clear or not. Please wait when it's implemented and documented.

Comment by Marc [ 2016 May 13 ]

I believe to have understood what you have described. But isn't that already possibly by trigger hysteresis?
E.g.:

{TRIGGER.VALUE} = 0 and {host:key.str(Apache stopped)} = 1 or
{TRIGGER.VALUE} = 1 and {host:key.str(Apache started)} = 0

Edit:
Or is it intended that even tags are dynamically applied from an extracted portion of the item value?

Comment by Andris Zeila [ 2016 Jun 17 ]

Yes, it's meant to work with tags created by extracting portion of item value.

Comment by Andris Zeila [ 2016 Jun 27 ]

(1) Database patch ready for testing in development branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-3274

Note that it includes partial database upgrade from ZBXNEXT-3201 (problem and problem_tag table changes)

glebs.ivanovskis Database patch looks good. CLOSED

Comment by Gunars Pujats (Inactive) [ 2016 Jul 04 ]

(2) [A] New fields for trigger, trigger prototype create and update methods.

gunarspujats Coding style

  • include/classes/api/services/CTriggerGeneral.php:541, 721 - use of deprecated function zbx_empty()
  • include/classes/api/services/CTriggerGeneral.php:528, 715 - indentation doesn't match coding guidelines

oleg.egorov RESOLVED in r60885

gunarspujats

  • It's possible to pass "/" character for correlation_tag.
  • it's possible to update correlation fields for templated trigger
  • new fields are not copied when linking template to template

REOPENED

oleg.egorov RESOLVED in r60926, r60923 and r60919

gunarspujats CLOSED. API tested.

Comment by Gunars Pujats (Inactive) [ 2016 Jul 05 ]

(3) [F] Trigger and trigger prototype forms

gunarspujats Coding style

  • include/views/configuration.triggers.edit.php:479, 492 - indentation doesn't match coding guidelines
  • triggers.php:279 - strict comparison used for integer values, non strict must be used
  • triggers.php:279, 282 - correlation_mode and correlation_tag if statements better to put under if ($db_trigger[templateid] == 0) statement at line 251
  • correlation fields are active for templated trigger prototypes

oleg.egorov RESOLVED in r60928, r60929

gunarspujats CLOSED. Frontend tested.

Comment by Andris Zeila [ 2016 Jul 05 ]

Server side ready for testing.

Comment by Gunars Pujats (Inactive) [ 2016 Jul 06 ]

(4) XML import/export successfully tested.

gunarspujats Updated import converter unit test in r60930

sasha Thanks! CLOSED

Comment by Glebs Ivanovskis (Inactive) [ 2016 Jul 11 ]

(5) [S] Trigger last change should not be updated if no event is generated. Suppose there is a trigger with multiple problem event generation, there are several open problems, trigger evaluates to OK, but there is no PROBLEM event to correlate it with.

wiper RESOLVED in r61146
Note that to update triggers server must save at least one event. So to reproduce this problem server must process several events at the same time.

glebs.ivanovskis Now works according to specification. CLOSED

Comment by Glebs Ivanovskis (Inactive) [ 2016 Jul 14 ]

(6) [S] This is the first time we define one DB field length in terms of another field length:

#define TRIGGER_CORRELATION_TAG_LEN	TRIGGER_TAG_LEN

I believe tags are going to be a universal concept across Zabbix so we should keep only one universal length limitation for all flavours of tags.

wiper RESOLVED in r61133

glebs.ivanovskis Great! CLOSED

Comment by Glebs Ivanovskis (Inactive) [ 2016 Jul 15 ]

(7) [S] Looks like (in-memory) tag value of potential OK event must be truncated to the size of tag value field in the database.

Imagine I want to correlate by tag value with tag value {ITEM.VALUE}. I use an always-true trigger {localhost:trapper.text.strlen()}>0 to generate PROBLEM events and flip it to always-false {localhost:trapper.text.strlen()}<0 when I want to generate OK events. In real world trigger "flipping" may occur if there are other items in expression. I send two identical values with a trigger flip in between and I expect my trigger to fire and then come back to OK. If sent value fits into 255 characters Zabbix behaves as expected. If item value is too long trigger stays in PROBLEM.

wiper RESOLVED in r61133

glebs.ivanovskis Truncation is done after validation. If tag name is {ITEM.VALUE} and {ITEM.VALUE} is "<255 spaces>something valid", tag name will be considered valid, but then truncated and saved as pure whitespace. Not sure how to proceed with that because if we truncate invalid part (containing "/") before validation it won't be good too.
REOPENED

wiper Another question if we should trim leading/trailing whitespace of tags/values
RESOLVED in r61328

glebs.ivanovskis Check for whitespace can now be removed from validate_event_tag().

wiper Right, RESOLVED in r61372

glebs.ivanovskis Nice!
CLOSED

Comment by Glebs Ivanovskis (Inactive) [ 2016 Jul 15 ]

(8) [D] Potential caveats for misconfiguration.

In the use case of two applications writing error and recovery messages to one log file user may be tempted to use two tags application with different tag values to apply separate regular expressions to extract application A name and application B name from {ITEM.VALUE} (e.g. message formats may differ). This may not work as planned because of current regexp and tag matching logic. Not-matching regexp will yield an empty tag value and a single empty tag value in PROBLEM and OK events is enough to correlate them. So a recovery message from application A may accidentally close error message from application B.

Invalid regular expressions are silently replaced with *UNKNOWN*. Since actual tags and tag values become visible only when trigger fires, debugging will not be easy. If initial PROBLEM event with *UNKNOWN* tag value was missed, subsequent OK events (with same *UNKNOWN* tag value) may close PROBLEM events which they shouldn't have closed.

And similar situation may happen if user uses {ITEM.VALUE} without macro functions as tag value and 255 character limitation kicks in. For example in the case when log messages are long and first 255 characters are non-specific.

sandis.neilands Encountered this and other severe usability issues while testing ZBXNEXT-3457. Some feedback or event tracing capability would be nice. See ZBXNEXT-3473.

martins-v RESOLVED in

glebs.ivanovskis At least second and third points are applicable to global correlation as well. And in my opinion such usability flaw in the most critical part of monitoring software deserves a big red banner in documentation.

martins-v RESOLVED

glebs.ivanovskis Thank you for such verbosity!
CLOSED

Comment by Alexander Vladishev [ 2016 Jul 15 ]

Available in pre-3.1.0 (trunk) r61049.

Comment by Glebs Ivanovskis (Inactive) [ 2016 Jul 19 ]

(9) [S] If a new trigger is created with OK event closes: All problems if tag values match and it receives values which are OK then no OK events are generated and trigger last change is not updated. In Monitoring -> Triggers this trigger stays with Last change: Never and Ack: No events until it receives first PROBLEM value.

wiper That's by design. In the described scenario the OK events are generated internally, but dropped (ignored). Trigger lastchange is updated only when an event is really generated (saved into db).
CLOSED

Comment by Andris Zeila [ 2016 Jul 19 ]

(10) [S] Processing of triggers with recovery_mode none is broken - when evaluating to OK an event with value 3 will be written into database.

There is a preliminary fix in ZBXNEXT-3277-4 r61115, not thoroughly tested yet.

wiper proper fix is attached (trigger_evaluation_fix.diff). It also fixes more bugs:

  • when new trigger with recovery expression is created, then it will generate OK event if the trigger expression evaluates to false, ignoring the recovery exprssion
  • when trying to recover multiple events the r_eventid is null condition was applied only to the first event problem lookup

wiper RESOLVED in r61117

wiper Actually I created also a new development branch in development branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-3274-2 to speed up the testing process of this problem, as it breaks multiple event processing.

glebs.ivanovskis Code logic in r61117 looks fine, no more false events generated in provided scenario. Please merge svn://svn.zabbix.com/branches/dev/ZBXNEXT-3274-2
CLOSED

wiper merged in trunk r61156

Comment by Andris Zeila [ 2016 Jul 20 ]

Created another development branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-3274-3 for fixing bugs (except the 10th subissue)

Comment by Andris Zeila [ 2016 Jul 20 ]

(11) [S] Aplied suggested code changes to event saving in r61130

glebs.ivanovskis I like it! Please have a look at one more suggestion in r61131. This will not work, reverted in r61134.

CLOSED

Comment by Glebs Ivanovskis (Inactive) [ 2016 Jul 21 ]

(12) [S] In correlate_events_by_trigger_rules() let's use flag variable with a meaningful name instead of

sql_offset_old = sql_offset;
...
if (sql_offset_old != sql_offset)
...

wiper RESOLVED in r61225

glebs.ivanovskis Thank you! CLOSED

Comment by Andris Zeila [ 2016 Jul 22 ]

(13) Tracking the number of open problems in trigger.problem_count is problematic because of housekeeper. Currently housekeeper can delete open problems. So we would have to do one of the following:

  • update triggers from housekeeper
  • remove only closed problems and add some mechanism to remove old events that were closed after housekeeper has processed this time period.

Neither of solutions are good. It would be better to drop the tracking of open problems in database/configuration cache and recalculate it every time when event from trigger with enabled correlation mode is processed.

wiper The problem_count was removed from triggers table and will be queried from problem table when an event is being correlated (either by trigger or global rule).
RESOLVED in r61166

glebs.ivanovskis Looks OK.
CLOSED

Comment by Andris Zeila [ 2016 Jul 26 ]

(14) [S] Freeing of unitialized memory when validating triggers during lld processing.
RESOLVED in r61217

sandis.neilands CLOSED.

Comment by Andris Zeila [ 2016 Jul 26 ]

(15) [S] The zbx_event_recovery_t structure used to pass closed event data to process_actions() has a lot of data used only for event processing and not really needed to expose outside it. It would be better to use simple sorted vector of uint64 pairs (problem eventid, ok eventid).
RESOLVED in r61224

sandis.neilands CLOSED.

Comment by Andris Mednis [ 2016 Aug 01 ]

(16) [F] When trying to disable any trigger like "Mounted filesystem discovery: Free disk space is less than 20% on volume ...." or "Mounted filesystem discovery: Free inodes is less than 20% on volume .... " an error message is shown in frontend:

Cannot disable trigger. 
Cannot update "correlation_mode" for a discovered trigger "Free disk space is less than 20% on volume ...."

gunarspujats RESOLVED in development branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-3274-4 , r61304.

sasha "Tag for matching" is not visible now for templated or lld-created triggers.

REOPENED

gunarspujats RESOLVED in development branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-3274-4 , r61371.

sasha correlation_mode and correlation_tag should be not visible when recovery_mode is None.

RESOLVED in dev branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-3274-4 r61398.

iivs CLOSED

Comment by Glebs Ivanovskis (Inactive) [ 2016 Aug 01 ]

(17) [S] If for LLD trigger expanded correlation tag is under 255 characters, expanded tag name is under 255 characters but expanded tag value is over 255 characters (Imagine prototype with "{#MACRO}", "{#MACRO}", "{#MACRO} plus few extra characters" respectively in these fields, and {#MACRO} is reasonably long) discovered trigger will have tag for matching but will not have a tag meaning that once becoming PROBLEM it will always stay in PROBLEM. In my opinion it is more logical to use not so strict rules for tag name validation in LLD. Maybe simple truncation will do?

wiper we decided to keep the same lld processing rules as for other objects. If the property fails validation then new objects will not be discovered and existing objects will be partially updated.
CLOSED

Comment by Glebs Ivanovskis (Inactive) [ 2016 Aug 01 ]

(18) [F] Long tags break the table and for some reason tag3 is longer than tag and tag2.

PavelA RESOLVED in dev branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-3277-4, r61367

sasha Thanks! CLOSED

Comment by Glebs Ivanovskis (Inactive) [ 2016 Aug 02 ]

(19) [S] Really minor. event_recovery and event_queue hashsets are never destroyed. It will not lead to memory leaks since their lifetime is as long as Zabbix runtime but is a bit unclean.

wiper It was done on purpose to avoid reallocating those hashsets each time events are generated. Although the possible gains are mostly theoretical, I wanted to keep it similar to events array allocation.

glebs.ivanovskis Don't get me wrong, I like resource re-usage. I didn't like initialize_events() in three different places. RESOLVED in r61324, please have a look.

wiper Oh, now I got it. I thought about global initialization, but didn't like that each process will have to initialize events. it didn't occured to me that it could be done before forking. I added rule cleanup to zbx_dc_correlation_rules_free() function, please review r61325

glebs.ivanovskis Valuable addition, thanks!
CLOSED

Comment by Andris Zeila [ 2016 Aug 02 ]

(20) [S] new triggers should not be discovered if tag validation failed. Also triggers must be checked for duplicated tags.

RESOLVED in r61327

glebs.ivanovskis Looks fine! Minor stylistic fix in r61374.
CLOSED

Comment by Gunars Pujats (Inactive) [ 2016 Aug 03 ]

(21) Translation string changes:

Strings added:

  • All problems
  • All problems if tag values match
  • OK event closes
  • Tag for matching

sasha CLOSED

Comment by Andris Zeila [ 2016 Aug 03 ]

(22) [S] Time based dependant trigger unlocking is broken. It depends on the order in which they triggers were retrieved, but the vector now is sorted by topology when triggers are processed.

RESOLVED in r61347

glebs.ivanovskis After several iterations (r61362, r61374, r61381, r61382), I guess, we have arrived at the final looks of time-based trigger processing. sandis.neilands, please review changes I made upon your request in r61383. From my side CLOSED

sandis.neilands Thanks! r61383 is CLOSED.

Comment by Glebs Ivanovskis (Inactive) [ 2016 Aug 03 ]

Server side successfully tested.

A note for those brave users who were following trunk very closely (r61049-r61393). After the merge of svn://svn.zabbix.com/branches/dev/ZBXNEXT-3274-3 into trunk (r61393) you will end up with redundant problem_count field in triggers table. It will do no harm, but feel free to remove it manually after server upgrade. Those who will use trunk post-r61393 or upgrade to 3.2 directly from 3.0 and earlier versions should not worry.

Comment by Andris Zeila [ 2016 Aug 04 ]

The ZBXNEXT-3274-3 development branch was merged into trunk r61393

Comment by Alexander Vladishev [ 2016 Aug 04 ]

(23) [A] trigger.create()/update() and triggerprototype.create()/update() silently ignore incompatible combinations of parameters.

For example:

[
    ...,
    'correlation_mode' => ZBX_TRIGGER_CORRELATION_NONE,
    'correlation_tag' => 'tag name'
]

sasha RESOLVED in dev branch svn://svn.zabbix.com/branches/dev/ZBXNEXT-3274-4 r61394.

iivs CLOSED

Comment by Alexander Vladishev [ 2016 Aug 04 ]

The ZBXNEXT-3274-4 development branch was merged into trunk r61430

Comment by Ivo Kurzemnieks [ 2016 Aug 05 ]

(24) [D] API documentation updated:

sandis.neilands REOPENED. Do we support tags in trigger object? trigger.create(). trigger.get(), etc? If yes - then it must be documented. Otherwise - it must be implemented.

iivs Yes, we do support them, but it's irrelevant to this task.
Trigger tags should've been documented here https://support.zabbix.com/browse/ZBXNEXT-2087 (25), but still isn't.

sasha CLOSED

Comment by dimir [ 2017 Feb 09 ]

Sub-issue (8) is still open.

Comment by Andris Zeila [ 2017 Feb 13 ]

Sent trigger event generation documentation draft for review & update.

Generated at Thu Apr 25 00:22:33 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.