[ZBXNEXT-7565] Backtrace what caused SLA to drop Created: 2022 Mar 18  Updated: 2022 Dec 29

Status: Open
Project: ZABBIX FEATURE REQUESTS
Component/s: Frontend (F)
Affects Version/s: 6.0.2
Fix Version/s: None

Type: Change Request Priority: Minor
Reporter: Aigars Kadikis Assignee: Unassigned
Resolution: Unresolved Votes: 8
Labels: SLA, sla
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 01-host-conf.png     PNG File 02-services-conf-problem-tags.png     PNG File 03-services-conf-mapping-sla.png     PNG File 04-farm1-sla-conf.png     PNG File 11-normal-state.png     PNG File 12-one-node-having-problem.png     PNG File 13-root-cause-revealed.png     PNG File 14-back-to-normal.png     PNG File 15-all-we-have.png     PNG File screenshot-2022-01-28_01.PNG    
Issue Links:
Sub-task

 Description   

In farm we have 2 or more servers: node1 and node2.
I have marked both servers with tag farm:1 and running "ICMP Ping" template:

Services configuration:

SLA conf:

 

Now in a normal situation the picture looks like this:

When there is a problem per one node, then it highlights the root cause:

By clicking on the link we can see the root cause:

When it goes back to normal state:

We can browse the SLA report to see SLO and SLI:

But there is no easy way to see what was the root cause for the SLA to drop.
(we can go to problems page, open history tab and filter out "farm1" and browse through records)

Kindly allow to see root cause for SLA drop straight from dedicated section.



 Comments   
Comment by Constantin Oshmyan [ 2022 Mar 18 ]

Real use case.
Up to Zabbix version 5.0 (inclusive) we had a page "Monitoring" -> "Services".
Our top management ("big bosses") could look at this page and see the current state.
Moreover, when the problem had occurred and then was fixed, it was possible to see the decreased SLA percentage level and unfold the Services tree to see the root cause of this (already closed) problem. You can see an example on the screenshot:

In the version 6.0 this functionality has been lost, it should be returned ASAP (or replaced by some other means).
Our big bosses are dissatisfied. It should be possible to see: what exactly was the reason causing to SLA subsidence.

Comment by Tomi Kajander [ 2022 May 17 ]

This functionality should be essential in any SLA reporting tool.

I second this feature request and hope that it can be reintroduced atleast similarly that it used to be in Zabbix 5.0. This was the biggest drawback when we upgraded from 5.0 to 6.0.

Comment by Constantin Oshmyan [ 2022 Oct 17 ]

An additional note here: the "Root cause" column should include the host information also (additionally to the problem name).

The real life case:

  • there were maintenance works (hardware replacement);
  • some triggers were fired, so the SLA level becomes <100%;
  • the old hosts (where problems fired) have been disabled (as they were out of service and replaced by the new ones);
  • screen "Services" -> "Services" still displays the bad SLA level with the problems related to the old hardware in the "root cause" column; however, it is not evidently what exactly are the problem hosts:
    • there is no information about hosts on the "Services" -> "Services" screen;
    • there is no any information after following by the provided links (as the real hosts were disabled, the "Problem" screen is empty for them);
    • it is not evidently: what hosts are needed to fix at the moment;
    • in other words: we see the name of problems, but we do not see where these problems are (and how to localize them).
Comment by WytheNet [ 2022 Dec 29 ]

I agree with what Constantin Oshmyan said,It bothers me that a disabled host still calculates the SLA, it's deactivated as expected!I suspect it's a bug?

Generated at Sat Jun 07 12:29:07 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.