Loading...

Type: New Feature Request
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Component/s: Frontend (F), Server (S)
Labels:
- eventcorrelation
- triggeractions

Background

Zabbix provides global event correlation rules that allow events to be closed or suppressed based on defined conditions (tags, event source, etc.). While this mechanism is useful to reduce alert noise, it is currently limited to closing events and does not fully support root cause analysis in complex infrastructures.

This feature request proposes enhancements to the global event correlation engine to better support cause/symptom relationships, severity handling, and action filtering.
The main use case described in this request refers to a multi-site infrastructure scenario, while additional applicable scenarios (e.g. application stacks, shared resources, power domains, and clustered environments) will be further illustrated and discussed in the comments.

Use case: multi-site connectivity failure

In a typical multi-site environment, each site contains multiple monitored devices (routers, firewalls, switches, servers, access points, etc.).

If the main connectivity device of a site (e.g. router or firewall) becomes unavailable:

All other hosts in that site become unreachable,
Zabbix generates multiple problem events (host unreachable, agent unavailable, ICMP loss, etc.),
Operators must manually identify that these problems are symptoms of a single root cause.

This results in:

Alert storms,
Reduced visibility of the real issue,
Manual effort to distinguish cause vs symptoms.

Proposed Tagging Model

Hosts and/or triggers can be consistently tagged, for example:

SITE:<site_name> (e.g. SITE:Milan)
ROLE:<device_role> (e.g. ROLE:firewall, ROLE:router, ROLE:switch, ROLE:server)

This tagging model already fits well with Zabbix best practices and is supported by triggers, events, correlation rules, and actions.

Proposed functional enhancements

1. Automatic cause/symptom classification via event correlation

Extend global event correlation rules to automatically classify related problems as:

Cause (root problem),
Symptom (secondary problems),

using logic such as:

Same SITE tag,
Specific ROLE values (e.g. firewall/router preferred as cause),
Event timing and dependency.

This would leverage and automate the existing cause and symptom concept currently available only through manual intervention in the UI.

2. Automatic severity adjustment for cause and symptom problems

Extend global event correlation rules to allow dynamic severity modification for both cause and symptom problems once a correlation relationship is established.

Specifically:

Increase the severity of the root cause problem (e.g. automatically promote it to Disaster) to clearly highlight the primary issue affecting the infrastructure,
Reduce the severity of all correlated symptom problems (e.g. from High to Warning or Information) to minimize noise while keeping visibility of impacted components.

Severity changes should be rule-driven and based on correlation conditions such as shared tags (e.g. SITE, ROLE, APP, RESOURCE) and event timing.

This approach would:

Make the real root cause immediately visible in the Problems view and dashboards,
Prevent alert storms caused by cascading failures,
Preserve contextual information about affected services without over-alerting.

3. Action filtering based on cause/symptom role

Extend trigger action conditions to allow filtering based on:

Event role = Cause
Event role = Symptom

This would enable advanced notification strategies, for example:

Notify on-call engineers only for root causes,
Send detailed symptom lists to a service desk or ticketing system,
Avoid duplicate or unnecessary alerts.

Example correlation logic (conceptual)

If a problem event with ROLE:firewall and SITE:X is active AND multiple other problems with the same SITE:X occur shortly after, then:

Mark the firewall event as Cause
Mark all related events as Symptoms
Optionally reduce severity of symptom events
Optionally increase severity of cause event
Allow actions to trigger only on the cause event

Benefits

Improved root cause analysis without manual intervention
Reduced alert noise while preserving visibility
Better scalability for large and distributed environments
Strong alignment with existing Zabbix concepts (tags, correlation, cause/symptom)

Details

Description

Background

Use case: multi-site connectivity failure

Proposed Tagging Model

Proposed functional enhancements

1. Automatic cause/symptom classification via event correlation

2. Automatic severity adjustment for cause and symptom problems

3. Action filtering based on cause/symptom role

Example correlation logic (conceptual)

Benefits

Attachments

Activity

People

Dates