ZABBIX FEATURE REQUESTS
  1. ZABBIX FEATURE REQUESTS
  2. ZBXNEXT-1461

Notify dependent triggers with an indication of the root problem plus dynamic dependency maps

    Details

      Description

      This ticket describes my wish from the conference talk "Generating Maps and Hosts from Topological Data" (https://www.youtube.com/watch?v=Sv0ZV05N5oI).

      =Scenario=

      A Zabbix server is connected to a monitored server via a switch

      Z-N-S

      Z ... Zabbix server
      N ... Networking equipment (switch)
      S ... Monitored server

      The switch is maintained by a network administrator (NA), the server is maintained by a server administrator (SA). S hosts a website for a customer.

      =A little story=

      In the middle of the night, SA receives a message on his phone, claiming S is unreachable. What happened? Various things go through his mind: Did somebody mess up the firewall, the nameserver? Did somebody pull a cable or did the switch break? Did the machine or the network stack crash? Is it actually the Zabbix server that became disconnected from this part of the network?

      2 minutes later the customer is on the phone, asking the drowsy SA what's wrong with his website. At this point he can't answer the customer's question. He doesn't even know if he can do anything about it. He only knows the problem is apparently real. If he's lucky, he'll reach a NA who can help him find out.

      Meanwhile in the network operations center: NA gets a message on the dashboard, saying N is unreachable. He doesn't know the topology very well and therefore doesn't know from the top of his head, which hosts are affected by this outage. He also has no idea what services the machines runs, that is behind the broken switch. He has no immediate sense of urgency besides the trigger severity. If he's lucky, somebody's maintaining some additional information, accessible in the enterprise wiki.

      This little story assumes, no trigger dependency is used. The trigger dependency mechanism could have been used to not notify the SA at all, because S depends on N. But that doesn't help the SA, because the customer would have still called him up and he'd have been even more clueless.

      =What if ...=

      What if the SA would have gotten a message saying:

      "Server S is unreachable. This is due to an outage of N."

      Reading it, he would have known he can't do anything about the problem. The website would most likely come back online when N is back to work. He could have answered the customer: "We have a network problem, NOC is working on it, have a good night!".

      Also, what if the NA had a dynamic map reflecting dependency? Optionally he could also receive messages like below, if he cares:

      "Switch N is unreachable. Therefore S is unreachable too."

      NA: "Gosh, S is an important server, I better hurry!"

      If you're creative with triggers and trigger dependency, you can visually reflect affected hosts on static maps and get this kind of notification. It only works in specific topologies though and may break silently, when you delete a host. This is not how it should be done.

      =What would be necessary?=

      I believe all necessary information already exists in the Zabbix database. Zabbix knows about the whole dependency chain. Instead of hiding away subsequent problems, it should also be possible to notify subsequent problems with the root problem message.

      =Known limitations=

      Trigger dependencies are connected with a logical OR. This might not be ideal for modelling topology.

        Activity

        Hide
        Oleksiy Zagorskyi added a comment -

        I'm not sure but ZBX-4744 looks a bit related.

        Show
        Oleksiy Zagorskyi added a comment - I'm not sure but ZBX-4744 looks a bit related.
        Hide
        Raymond Kuiper added a comment -

        I disagree, Volker is suggesting a fundamental change in handling trigger dependencies in actions and maps.

        Show
        Raymond Kuiper added a comment - I disagree, Volker is suggesting a fundamental change in handling trigger dependencies in actions and maps.
        Hide
        Volker Fröhlich added a comment -

        Related to ZBXNEXT-1333

        Show
        Volker Fröhlich added a comment - Related to ZBXNEXT-1333
        Hide
        Volker Fröhlich added a comment -

        Loosely connected to ZBXNEXT-1547

        Show
        Volker Fröhlich added a comment - Loosely connected to ZBXNEXT-1547
        Hide
        romale added a comment - - edited

        >all necessary information already exists in the Zabbix database. Zabbix knows about the whole dependency chain.

        imho, in current state, zabbix does not know about dependencies. Firstly, it should create a network tree. i think it should discover who is host, who is switch, router etc and who of this at righ side and at left side for topology compilation and relations between devices. Also, it should to know who is Port for root cause discovery and port to port mappings. This mechanism is requires model-oriented approach. And, since, all of network devices are typically SNMP-based, so, discovery should rely on SNMP.
        Examples.
        Our topology: ZABBIX-switch_port11-SWITCH1-switch_port12router_port1-ROUTER1-router_port2switch_port21-SWITCH2-switch_port22-HOST1
        Scenario:
        If switch_port21 is disconnected (we are receive router_port2 LinkUp/Down SNMP Trap, we can use this info in correlation).
        Remark: Zabbix DB may contain info: router_port2 connected to switch_port21 (based on mac address table info while topology discovery), or on upper level relation: ROUTER1 at left side has SWITCH1, and ROUTER1 at right side has SWITCH2.
        When HOST1 is unreachable, zabbix poll by icmp or snmp or tcp the SWITCH2: hey, are you OK? No answer. Next step - zabbix poll ROUTER2: he is replay via icmp for example, it worked and router_port2 status is Down.
        So, HOST1 unreachable - is secondary issue and should be suppressed (suppresed message should be emailed to SA with root cause "SWITCH2 is down"), and root cause - "SWITCH2 is down" (this message to NA with affected devices for example. What if 400 hosts is connected to SWITCH2, cisco 6513? NA should know a scale of trouble ).

        This is my reflections about network topology discovery and dependencies and about some implementations.
        imho, just templates with dependencies is not enough today. It creates human errors, complexity of maintenance and tight monitoring.

        Show
        romale added a comment - - edited >all necessary information already exists in the Zabbix database. Zabbix knows about the whole dependency chain. imho, in current state, zabbix does not know about dependencies. Firstly, it should create a network tree. i think it should discover who is host, who is switch, router etc and who of this at righ side and at left side for topology compilation and relations between devices. Also, it should to know who is Port for root cause discovery and port to port mappings. This mechanism is requires model-oriented approach. And, since, all of network devices are typically SNMP-based, so, discovery should rely on SNMP. Examples. Our topology: ZABBIX- switch_port11-SWITCH1-switch_port12 router_port1-ROUTER1-router_port2 switch_port21-SWITCH2-switch_port22 -HOST1 Scenario: If switch_port21 is disconnected (we are receive router_port2 LinkUp/Down SNMP Trap, we can use this info in correlation). Remark: Zabbix DB may contain info: router_port2 connected to switch_port21 (based on mac address table info while topology discovery), or on upper level relation: ROUTER1 at left side has SWITCH1, and ROUTER1 at right side has SWITCH2. When HOST1 is unreachable, zabbix poll by icmp or snmp or tcp the SWITCH2: hey, are you OK? No answer. Next step - zabbix poll ROUTER2: he is replay via icmp for example, it worked and router_port2 status is Down. So, HOST1 unreachable - is secondary issue and should be suppressed (suppresed message should be emailed to SA with root cause "SWITCH2 is down"), and root cause - "SWITCH2 is down" (this message to NA with affected devices for example. What if 400 hosts is connected to SWITCH2, cisco 6513? NA should know a scale of trouble ). This is my reflections about network topology discovery and dependencies and about some implementations. imho, just templates with dependencies is not enough today. It creates human errors, complexity of maintenance and tight monitoring.

          People

          • Assignee:
            Alexei Vladishev
            Reporter:
            Volker Fröhlich
          • Votes:
            16 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated: