-
Type:
New Feature Request
-
Resolution: Unresolved
-
Priority:
Minor
-
None
-
Affects Version/s: None
-
Component/s: Frontend (F), Server (S)
Background
Zabbix provides global event correlation rules that allow events to be closed or suppressed based on defined conditions (tags, event source, etc.). While this mechanism is useful to reduce alert noise, it is currently limited to closing events and does not fully support root cause analysis in complex infrastructures.
This feature request proposes enhancements to the global event correlation engine to better support cause/symptom relationships, severity handling, and action filtering.
The main use case described in this request refers to a multi-site infrastructure scenario, while additional applicable scenarios (e.g. application stacks, shared resources, power domains, and clustered environments) will be further illustrated and discussed in the comments.
Use case: multi-site connectivity failure
In a typical multi-site environment, each site contains multiple monitored devices (routers, firewalls, switches, servers, access points, etc.).
If the main connectivity device of a site (e.g. router or firewall) becomes unavailable:
- All other hosts in that site become unreachable,
- Zabbix generates multiple problem events (host unreachable, agent unavailable, ICMP loss, etc.),
- Operators must manually identify that these problems are symptoms of a single root cause.
This results in:
- Alert storms,
- Reduced visibility of the real issue,
- Manual effort to distinguish cause vs symptoms.
Proposed Tagging Model
Hosts and/or triggers can be consistently tagged, for example:
- SITE:<site_name> (e.g. SITE:Milan)
- ROLE:<device_role> (e.g. ROLE:firewall, ROLE:router, ROLE:switch, ROLE:server)
This tagging model already fits well with Zabbix best practices and is supported by triggers, events, correlation rules, and actions.
Proposed functional enhancements
1. Automatic cause/symptom classification via event correlation
Extend global event correlation rules to automatically classify related problems as:
- Cause (root problem),
- Symptom (secondary problems),
using logic such as:
- Same SITE tag,
- Specific ROLE values (e.g. firewall/router preferred as cause),
- Event timing and dependency.
This would leverage and automate the existing cause and symptom concept currently available only through manual intervention in the UI.
2. Automatic severity adjustment for cause and symptom problems
Extend global event correlation rules to allow dynamic severity modification for both cause and symptom problems once a correlation relationship is established.
Specifically:
- Increase the severity of the root cause problem (e.g. automatically promote it to Disaster) to clearly highlight the primary issue affecting the infrastructure,
- Reduce the severity of all correlated symptom problems (e.g. from High to Warning or Information) to minimize noise while keeping visibility of impacted components.
Severity changes should be rule-driven and based on correlation conditions such as shared tags (e.g. SITE, ROLE, APP, RESOURCE) and event timing.
This approach would:
- Make the real root cause immediately visible in the Problems view and dashboards,
- Prevent alert storms caused by cascading failures,
- Preserve contextual information about affected services without over-alerting.
3. Action filtering based on cause/symptom role
Extend trigger action conditions to allow filtering based on:
- Event role = Cause
- Event role = Symptom
This would enable advanced notification strategies, for example:
- Notify on-call engineers only for root causes,
- Send detailed symptom lists to a service desk or ticketing system,
- Avoid duplicate or unnecessary alerts.
Example correlation logic (conceptual)
If a problem event with ROLE:firewall and SITE:X is active AND multiple other problems with the same SITE:X occur shortly after, then:
- Mark the firewall event as Cause
- Mark all related events as Symptoms
- Optionally reduce severity of symptom events
- Optionally increase severity of cause event
- Allow actions to trigger only on the cause event
Benefits
- Improved root cause analysis without manual intervention
- Reduced alert noise while preserving visibility
- Better scalability for large and distributed environments
- Strong alignment with existing Zabbix concepts (tags, correlation, cause/symptom)