ZABBIX BUGS AND ISSUES
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-4774

High cpu load with postgresql monitoring 5000 discovered snmp items

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 2.0.0rc1
    • Fix Version/s: None
    • Component/s: Server (S)
    • Labels:
    • Environment:

      Description

      Hi, I'm using postgresql and the new 2.0 low level discovery to monitor some switches.
      Everything is fine with most of them, the items are discovered and monitor fine without much load on the server. But one is a 9 unit stack with hundreds of ports.
      After discovery I have 5173 items and 2792 triggers. The database starts to eat CPU (more than 80% all the time) and the zabbix queue starts to fill up losing events...
      The very strange thing is that this happens even if I keep all the items disabled!!! Just adding them to the host kills the db.
      This didn't happen with 1.8 monitoring the same items (with a script generated template)

        Activity

        Hide
        Oleksiy Zagorskyi added a comment -

        Which update interval for the discovery rule is used at the moment?
        How many similar discovery rules (for big switches) do you have?
        I suppose for similar cases should be used the big update interval like one hour or even more.

        Show
        Oleksiy Zagorskyi added a comment - Which update interval for the discovery rule is used at the moment? How many similar discovery rules (for big switches) do you have? I suppose for similar cases should be used the big update interval like one hour or even more.
        Hide
        Cristian Mammoli added a comment -

        Port

        {#SNMPVALUE} AdminStatus IF-MIB-ifAdminStatus.[{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network
        Port {#SNMPVALUE}

        Alias IF-MIB-ifAlias.[

        {#SNMPINDEX}] 1800 30 SNMPv2 agent Enabled Network
        Port {#SNMPVALUE} Collisions .1.3.6.1.4.1.9.2.2.1.1.25.[{#SNMPINDEX}

        ] 1800 30 365 SNMPv2 agent Enabled Network
        Port

        {#SNMPVALUE} Description IF-MIB-ifDescr.[{#SNMPINDEX}] 1800 30 SNMPv2 agent Enabled Network Port {#SNMPVALUE}

        InCRC .1.3.6.1.4.1.9.2.2.1.1.12.[

        {#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} InErrors IF-MIB-ifInErrors.[{#SNMPINDEX}

        ] 1800 30 365 SNMPv2 agent Enabled Network
        Port

        {#SNMPVALUE} InNUcastPkts IF-MIB-ifInNUcastPkts.[{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network
        Port {#SNMPVALUE}

        InOctets IF-MIB-ifHCInOctets.[

        {#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network
        Port {#SNMPVALUE} InUcastPkts IF-MIB-ifHCInUcastPkts.[{#SNMPINDEX}

        ] 180 30 365 SNMPv2 agent Enabled Network
        Port

        {#SNMPVALUE} OperStatus IF-MIB-ifOperStatus.[{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network
        Port {#SNMPVALUE}

        OutErrors IF-MIB-ifOutErrors.[

        {#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network
        Port {#SNMPVALUE} OutNUcastPkts IF-MIB-ifOutNUcastPkts.[{#SNMPINDEX}

        ] 180 30 365 SNMPv2 agent Enabled Network
        Port

        {#SNMPVALUE} OutOctets IF-MIB-ifHCOutOctets.[{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network
        Port {#SNMPVALUE}

        OutUcastPkts IF-MIB-ifHCOutUcastPkts.[

        {#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network
        Port {#SNMPVALUE} Speed IF-MIB-ifSpeed.[{#SNMPINDEX}

        ] 3600 30 365 SNMPv2 agent Enabled Network

        Here you are, 180 secs for most checks. I don't think that the interval is the issue anyway, as I said the high load happens even with all items disabled! And what's the point of monitoring port traffic ONCE in a hour

        Show
        Cristian Mammoli added a comment - Port {#SNMPVALUE} AdminStatus IF-MIB-ifAdminStatus. [{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} Alias IF-MIB-ifAlias.[ {#SNMPINDEX}] 1800 30 SNMPv2 agent Enabled Network Port {#SNMPVALUE} Collisions .1.3.6.1.4.1.9.2.2.1.1.25.[{#SNMPINDEX} ] 1800 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} Description IF-MIB-ifDescr. [{#SNMPINDEX}] 1800 30 SNMPv2 agent Enabled Network Port {#SNMPVALUE} InCRC .1.3.6.1.4.1.9.2.2.1.1.12.[ {#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} InErrors IF-MIB-ifInErrors.[{#SNMPINDEX} ] 1800 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} InNUcastPkts IF-MIB-ifInNUcastPkts. [{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} InOctets IF-MIB-ifHCInOctets.[ {#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} InUcastPkts IF-MIB-ifHCInUcastPkts.[{#SNMPINDEX} ] 180 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} OperStatus IF-MIB-ifOperStatus. [{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} OutErrors IF-MIB-ifOutErrors.[ {#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} OutNUcastPkts IF-MIB-ifOutNUcastPkts.[{#SNMPINDEX} ] 180 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} OutOctets IF-MIB-ifHCOutOctets. [{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} OutUcastPkts IF-MIB-ifHCOutUcastPkts.[ {#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} Speed IF-MIB-ifSpeed.[{#SNMPINDEX} ] 3600 30 365 SNMPv2 agent Enabled Network Here you are, 180 secs for most checks. I don't think that the interval is the issue anyway, as I said the high load happens even with all items disabled! And what's the point of monitoring port traffic ONCE in a hour
        Hide
        Cristian Mammoli added a comment -

        Sorry, I didn't understand you meant the discovery rule interval and not the items! It's every 3600 secs

        Show
        Cristian Mammoli added a comment - Sorry, I didn't understand you meant the discovery rule interval and not the items! It's every 3600 secs
        Hide
        Oleksiy Zagorskyi added a comment - - edited

        Cristian, I meant the discovery rule but not item prototypes. They are different things.
        And the update interval for discovery rules is VERY important for your big switches.

        Find a text:
        The field “Update interval (in sec)” specifies how often Zabbix performs discovery. In the beginning, when you are just setting up file system discovery, you might wish to set it to a small interval, but once you know it works you can set it to 30 minutes or more, because file systems usually do not change very often.
        here:
        http://www.zabbix.com/documentation/2.0/manual/discovery/low_level_discovery

        Show
        Oleksiy Zagorskyi added a comment - - edited Cristian, I meant the discovery rule but not item prototypes . They are different things. And the update interval for discovery rules is VERY important for your big switches. Find a text: The field “Update interval (in sec)” specifies how often Zabbix performs discovery. In the beginning, when you are just setting up file system discovery, you might wish to set it to a small interval, but once you know it works you can set it to 30 minutes or more, because file systems usually do not change very often. here: http://www.zabbix.com/documentation/2.0/manual/discovery/low_level_discovery
        Hide
        Cristian Mammoli added a comment -

        I replied right above you, but I don't have "load spikes" very hour: the load is continuous. I can set the discovery interval to something like 86400 and see if things get better

        Show
        Cristian Mammoli added a comment - I replied right above you, but I don't have "load spikes" very hour: the load is continuous. I can set the discovery interval to something like 86400 and see if things get better
        Hide
        Oleksiy Zagorskyi added a comment -

        > but I don't have "load spikes" very hour: the load is continuous. I can set the discovery interval to something like 86400 and see if things get better

        Yes, try it.
        5173 discovered items (I suppose it's count of interfaces, maybe even filtered) it's not a joke but enough serious task.

        Show
        Oleksiy Zagorskyi added a comment - > but I don't have "load spikes" very hour: the load is continuous. I can set the discovery interval to something like 86400 and see if things get better Yes, try it. 5173 discovered items (I suppose it's count of interfaces, maybe even filtered) it's not a joke but enough serious task.
        Hide
        Cristian Mammoli added a comment -

        zabbix 2.0 cpu load with postgresql

        Show
        Cristian Mammoli added a comment - zabbix 2.0 cpu load with postgresql
        Hide
        Cristian Mammoli added a comment -

        As you can see from the attached image as soon as I added the switch the load skyrocketed adn didn't drop for many hours, so I don' think the discovery interval is the issue here.

        As a test I dumped all the configuration and reimported into mysql. The load dropped from 1.6 to 0.3. I'll keep testing with mysql but I think the problem is linked to pgsql.

        Thanks

        Show
        Cristian Mammoli added a comment - As you can see from the attached image as soon as I added the switch the load skyrocketed adn didn't drop for many hours, so I don' think the discovery interval is the issue here. As a test I dumped all the configuration and reimported into mysql. The load dropped from 1.6 to 0.3. I'll keep testing with mysql but I think the problem is linked to pgsql. Thanks
        Hide
        Cristian Mammoli added a comment -

        I was being too optimist, I still have heavy load with MySQL but the situation is way better: average load around 0.7 as soon as I add the switch to the template. I still have the pgsql db in place so if you need some data just ask

        Show
        Cristian Mammoli added a comment - I was being too optimist, I still have heavy load with MySQL but the situation is way better: average load around 0.7 as soon as I add the switch to the template. I still have the pgsql db in place so if you need some data just ask
        Hide
        Cristian Mammoli added a comment -

        Well, shame on me: I had a flexible interval of 50 secs on the discovery rule. Removed it and the load now is around 0.2. Thank Oleksiy for your time and again sorry. You can close.

        Show
        Cristian Mammoli added a comment - Well, shame on me: I had a flexible interval of 50 secs on the discovery rule. Removed it and the load now is around 0.2. Thank Oleksiy for your time and again sorry. You can close.
        Hide
        Oleksiy Zagorskyi added a comment -

        Issue closed as Cannot reproduce

        Show
        Oleksiy Zagorskyi added a comment - Issue closed as Cannot reproduce
        Hide
        Oleksiy Zagorskyi added a comment -

        Cristian, btw, would be interesting to know how many time zabbix server spends to discover and process single discovery rule with creation 5173 items and 2792 triggers.
        Could you somehow measure?

        Show
        Oleksiy Zagorskyi added a comment - Cristian, btw, would be interesting to know how many time zabbix server spends to discover and process single discovery rule with creation 5173 items and 2792 triggers. Could you somehow measure?
        Hide
        Cristian Mammoli added a comment -

        Well I can create a new empty db and only import the discovery template. I'll do some tests this evening and let you know

        Bye

        Show
        Cristian Mammoli added a comment - Well I can create a new empty db and only import the discovery template. I'll do some tests this evening and let you know Bye
        Hide
        Cristian Mammoli added a comment -

        I created an empty db and populated it with the schema and so on, then I stopped zabbix_server and started with the new db.
        Imported the discovery template, added the switch to it and logged everything with tcpdump. Surprisingly it lasted only 2 seconds to snmpwalk the switch and populate the items:

        2012-03-21 20:43:04.946398 IP srvzabbix.xxxxx.xx.52670 > sw3570racka.xxxxx.xx.snmp: GetNextRequest(30) 31.1.1.1.1
        ...
        2012-03-21 20:43:06.160733 IP sw3570racka.xxxxx.xx.snmp > srvzabbix.xxxxx.xx.52670: GetResponse(35) 31.1.1.1.1.14501="Nu0"

        So I don't understand why with a discovery every 50 secs it was putting so much load on the db

        Show
        Cristian Mammoli added a comment - I created an empty db and populated it with the schema and so on, then I stopped zabbix_server and started with the new db. Imported the discovery template, added the switch to it and logged everything with tcpdump. Surprisingly it lasted only 2 seconds to snmpwalk the switch and populate the items: 2012-03-21 20:43:04.946398 IP srvzabbix.xxxxx.xx.52670 > sw3570racka.xxxxx.xx.snmp: GetNextRequest(30) 31.1.1.1.1 ... 2012-03-21 20:43:06.160733 IP sw3570racka.xxxxx.xx.snmp > srvzabbix.xxxxx.xx.52670: GetResponse(35) 31.1.1.1.1.14501="Nu0" So I don't understand why with a discovery every 50 secs it was putting so much load on the db
        Hide
        Oleksiy Zagorskyi added a comment -

        You observed only network traffic, but probably would be correct to watch CPU utilization after that SNMP walk.

        Show
        Oleksiy Zagorskyi added a comment - You observed only network traffic, but probably would be correct to watch CPU utilization after that SNMP walk.

          People

          • Assignee:
            Unassigned
            Reporter:
            Cristian Mammoli
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: