-
New Feature Request
-
Resolution: Unresolved
-
Major
-
None
-
None
This ticket describes my wish from the conference talk "Generating Maps and Hosts from Topological Data" (https://www.youtube.com/watch?v=Sv0ZV05N5oI).
=Scenario=
A Zabbix server is connected to a monitored server via a switch
Z-N-S
Z ... Zabbix server
N ... Networking equipment (switch)
S ... Monitored server
The switch is maintained by a network administrator (NA), the server is maintained by a server administrator (SA). S hosts a website for a customer.
=A little story=
In the middle of the night, SA receives a message on his phone, claiming S is unreachable. What happened? Various things go through his mind: Did somebody mess up the firewall, the nameserver? Did somebody pull a cable or did the switch break? Did the machine or the network stack crash? Is it actually the Zabbix server that became disconnected from this part of the network?
2 minutes later the customer is on the phone, asking the drowsy SA what's wrong with his website. At this point he can't answer the customer's question. He doesn't even know if he can do anything about it. He only knows the problem is apparently real. If he's lucky, he'll reach a NA who can help him find out.
Meanwhile in the network operations center: NA gets a message on the dashboard, saying N is unreachable. He doesn't know the topology very well and therefore doesn't know from the top of his head, which hosts are affected by this outage. He also has no idea what services the machines runs, that is behind the broken switch. He has no immediate sense of urgency besides the trigger severity. If he's lucky, somebody's maintaining some additional information, accessible in the enterprise wiki.
This little story assumes, no trigger dependency is used. The trigger dependency mechanism could have been used to not notify the SA at all, because S depends on N. But that doesn't help the SA, because the customer would have still called him up and he'd have been even more clueless.
=What if ...=
What if the SA would have gotten a message saying:
"Server S is unreachable. This is due to an outage of N."
Reading it, he would have known he can't do anything about the problem. The website would most likely come back online when N is back to work. He could have answered the customer: "We have a network problem, NOC is working on it, have a good night!".
Also, what if the NA had a dynamic map reflecting dependency? Optionally he could also receive messages like below, if he cares:
"Switch N is unreachable. Therefore S is unreachable too."
NA: "Gosh, S is an important server, I better hurry!"
If you're creative with triggers and trigger dependency, you can visually reflect affected hosts on static maps and get this kind of notification. It only works in specific topologies though and may break silently, when you delete a host. This is not how it should be done.
=What would be necessary?=
I believe all necessary information already exists in the Zabbix database. Zabbix knows about the whole dependency chain. Instead of hiding away subsequent problems, it should also be possible to notify subsequent problems with the root problem message.
=Known limitations=
Trigger dependencies are connected with a logical OR. This might not be ideal for modelling topology.