ZABBIX BUGS AND ISSUES

High cpu load with postgresql monitoring 5000 discovered snmp items

Details

  • Type: Bug Bug
  • Status: Closed Closed
  • Priority: Major Major
  • Resolution: Cannot Reproduce
  • Affects Version/s: 2.0.0rc1
  • Fix Version/s: None
  • Component/s: Server (S)
  • Labels:
  • Environment:
  • Zabbix ID:
    NA

Description

Hi, I'm using postgresql and the new 2.0 low level discovery to monitor some switches.
Everything is fine with most of them, the items are discovered and monitor fine without much load on the server. But one is a 9 unit stack with hundreds of ports.
After discovery I have 5173 items and 2792 triggers. The database starts to eat CPU (more than 80% all the time) and the zabbix queue starts to fill up losing events...
The very strange thing is that this happens even if I keep all the items disabled!!! Just adding them to the host kills the db.
This didn't happen with 1.8 monitoring the same items (with a script generated template)

Activity

Hide
Oleksiy Zagorskyi added a comment -

Which update interval for the discovery rule is used at the moment?
How many similar discovery rules (for big switches) do you have?
I suppose for similar cases should be used the big update interval like one hour or even more.

Show
Oleksiy Zagorskyi added a comment - Which update interval for the discovery rule is used at the moment? How many similar discovery rules (for big switches) do you have? I suppose for similar cases should be used the big update interval like one hour or even more.
Hide
Cristian Mammoli added a comment -

Port {#SNMPVALUE} AdminStatus IF-MIB-ifAdminStatus.[{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE} Alias IF-MIB-ifAlias.[{#SNMPINDEX}] 1800 30 SNMPv2 agent Enabled Network
Port {#SNMPVALUE} Collisions .1.3.6.1.4.1.9.2.2.1.1.25.[{#SNMPINDEX}]
1800 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE} Description IF-MIB-ifDescr.[{#SNMPINDEX}] 1800 30 SNMPv2 agent Enabled Network Port {#SNMPVALUE} InCRC .1.3.6.1.4.1.9.2.2.1.1.12.[{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} InErrors IF-MIB-ifInErrors.[{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE} InNUcastPkts IF-MIB-ifInNUcastPkts.[{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE} InOctets IF-MIB-ifHCInOctets.[{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE} InUcastPkts IF-MIB-ifHCInUcastPkts.[{#SNMPINDEX}]
180 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE} OperStatus IF-MIB-ifOperStatus.[{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE} OutErrors IF-MIB-ifOutErrors.[{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE} OutNUcastPkts IF-MIB-ifOutNUcastPkts.[{#SNMPINDEX}]
180 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE} OutOctets IF-MIB-ifHCOutOctets.[{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE} OutUcastPkts IF-MIB-ifHCOutUcastPkts.[{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE} Speed IF-MIB-ifSpeed.[{#SNMPINDEX}]
3600 30 365 SNMPv2 agent Enabled Network

Here you are, 180 secs for most checks. I don't think that the interval is the issue anyway, as I said the high load happens even with all items disabled! And what's the point of monitoring port traffic ONCE in a hour

Show
Cristian Mammoli added a comment - Port {#SNMPVALUE} AdminStatus IF-MIB-ifAdminStatus.[{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} Alias IF-MIB-ifAlias.[{#SNMPINDEX}] 1800 30 SNMPv2 agent Enabled Network Port {#SNMPVALUE} Collisions .1.3.6.1.4.1.9.2.2.1.1.25.[{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} Description IF-MIB-ifDescr.[{#SNMPINDEX}] 1800 30 SNMPv2 agent Enabled Network Port {#SNMPVALUE} InCRC .1.3.6.1.4.1.9.2.2.1.1.12.[{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} InErrors IF-MIB-ifInErrors.[{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} InNUcastPkts IF-MIB-ifInNUcastPkts.[{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} InOctets IF-MIB-ifHCInOctets.[{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} InUcastPkts IF-MIB-ifHCInUcastPkts.[{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} OperStatus IF-MIB-ifOperStatus.[{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} OutErrors IF-MIB-ifOutErrors.[{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} OutNUcastPkts IF-MIB-ifOutNUcastPkts.[{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} OutOctets IF-MIB-ifHCOutOctets.[{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} OutUcastPkts IF-MIB-ifHCOutUcastPkts.[{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} Speed IF-MIB-ifSpeed.[{#SNMPINDEX}] 3600 30 365 SNMPv2 agent Enabled Network Here you are, 180 secs for most checks. I don't think that the interval is the issue anyway, as I said the high load happens even with all items disabled! And what's the point of monitoring port traffic ONCE in a hour
Hide
Cristian Mammoli added a comment -

Sorry, I didn't understand you meant the discovery rule interval and not the items! It's every 3600 secs

Show
Cristian Mammoli added a comment - Sorry, I didn't understand you meant the discovery rule interval and not the items! It's every 3600 secs
Hide
Oleksiy Zagorskyi added a comment - - edited

Cristian, I meant the discovery rule but not item prototypes. They are different things.
And the update interval for discovery rules is VERY important for your big switches.

Find a text:
The field “Update interval (in sec)” specifies how often Zabbix performs discovery. In the beginning, when you are just setting up file system discovery, you might wish to set it to a small interval, but once you know it works you can set it to 30 minutes or more, because file systems usually do not change very often.
here:
http://www.zabbix.com/documentation/2.0/manual/discovery/low_level_discovery

Show
Oleksiy Zagorskyi added a comment - - edited Cristian, I meant the discovery rule but not item prototypes. They are different things. And the update interval for discovery rules is VERY important for your big switches. Find a text: The field “Update interval (in sec)” specifies how often Zabbix performs discovery. In the beginning, when you are just setting up file system discovery, you might wish to set it to a small interval, but once you know it works you can set it to 30 minutes or more, because file systems usually do not change very often. here: http://www.zabbix.com/documentation/2.0/manual/discovery/low_level_discovery
Hide
Cristian Mammoli added a comment -

I replied right above you, but I don't have "load spikes" very hour: the load is continuous. I can set the discovery interval to something like 86400 and see if things get better

Show
Cristian Mammoli added a comment - I replied right above you, but I don't have "load spikes" very hour: the load is continuous. I can set the discovery interval to something like 86400 and see if things get better
Hide
Oleksiy Zagorskyi added a comment -

> but I don't have "load spikes" very hour: the load is continuous. I can set the discovery interval to something like 86400 and see if things get better

Yes, try it.
5173 discovered items (I suppose it's count of interfaces, maybe even filtered) it's not a joke but enough serious task.

Show
Oleksiy Zagorskyi added a comment - > but I don't have "load spikes" very hour: the load is continuous. I can set the discovery interval to something like 86400 and see if things get better Yes, try it. 5173 discovered items (I suppose it's count of interfaces, maybe even filtered) it's not a joke but enough serious task.
Hide
Cristian Mammoli added a comment -

zabbix 2.0 cpu load with postgresql

Show
Cristian Mammoli added a comment - zabbix 2.0 cpu load with postgresql
Hide
Cristian Mammoli added a comment -

As you can see from the attached image as soon as I added the switch the load skyrocketed adn didn't drop for many hours, so I don' think the discovery interval is the issue here.

As a test I dumped all the configuration and reimported into mysql. The load dropped from 1.6 to 0.3. I'll keep testing with mysql but I think the problem is linked to pgsql.

Thanks

Show
Cristian Mammoli added a comment - As you can see from the attached image as soon as I added the switch the load skyrocketed adn didn't drop for many hours, so I don' think the discovery interval is the issue here. As a test I dumped all the configuration and reimported into mysql. The load dropped from 1.6 to 0.3. I'll keep testing with mysql but I think the problem is linked to pgsql. Thanks
Hide
Cristian Mammoli added a comment -

I was being too optimist, I still have heavy load with MySQL but the situation is way better: average load around 0.7 as soon as I add the switch to the template. I still have the pgsql db in place so if you need some data just ask

Show
Cristian Mammoli added a comment - I was being too optimist, I still have heavy load with MySQL but the situation is way better: average load around 0.7 as soon as I add the switch to the template. I still have the pgsql db in place so if you need some data just ask
Hide
Cristian Mammoli added a comment -

Well, shame on me: I had a flexible interval of 50 secs on the discovery rule. Removed it and the load now is around 0.2. Thank Oleksiy for your time and again sorry. You can close.

Show
Cristian Mammoli added a comment - Well, shame on me: I had a flexible interval of 50 secs on the discovery rule. Removed it and the load now is around 0.2. Thank Oleksiy for your time and again sorry. You can close.
Hide
Oleksiy Zagorskyi added a comment -

Issue closed as Cannot reproduce

Show
Oleksiy Zagorskyi added a comment - Issue closed as Cannot reproduce
Hide
Oleksiy Zagorskyi added a comment -

Cristian, btw, would be interesting to know how many time zabbix server spends to discover and process single discovery rule with creation 5173 items and 2792 triggers.
Could you somehow measure?

Show
Oleksiy Zagorskyi added a comment - Cristian, btw, would be interesting to know how many time zabbix server spends to discover and process single discovery rule with creation 5173 items and 2792 triggers. Could you somehow measure?
Hide
Cristian Mammoli added a comment -

Well I can create a new empty db and only import the discovery template. I'll do some tests this evening and let you know

Bye

Show
Cristian Mammoli added a comment - Well I can create a new empty db and only import the discovery template. I'll do some tests this evening and let you know Bye
Hide
Cristian Mammoli added a comment -

I created an empty db and populated it with the schema and so on, then I stopped zabbix_server and started with the new db.
Imported the discovery template, added the switch to it and logged everything with tcpdump. Surprisingly it lasted only 2 seconds to snmpwalk the switch and populate the items:

2012-03-21 20:43:04.946398 IP srvzabbix.xxxxx.xx.52670 > sw3570racka.xxxxx.xx.snmp: GetNextRequest(30) 31.1.1.1.1
...
2012-03-21 20:43:06.160733 IP sw3570racka.xxxxx.xx.snmp > srvzabbix.xxxxx.xx.52670: GetResponse(35) 31.1.1.1.1.14501="Nu0"

So I don't understand why with a discovery every 50 secs it was putting so much load on the db

Show
Cristian Mammoli added a comment - I created an empty db and populated it with the schema and so on, then I stopped zabbix_server and started with the new db. Imported the discovery template, added the switch to it and logged everything with tcpdump. Surprisingly it lasted only 2 seconds to snmpwalk the switch and populate the items: 2012-03-21 20:43:04.946398 IP srvzabbix.xxxxx.xx.52670 > sw3570racka.xxxxx.xx.snmp: GetNextRequest(30) 31.1.1.1.1 ... 2012-03-21 20:43:06.160733 IP sw3570racka.xxxxx.xx.snmp > srvzabbix.xxxxx.xx.52670: GetResponse(35) 31.1.1.1.1.14501="Nu0" So I don't understand why with a discovery every 50 secs it was putting so much load on the db
Hide
Oleksiy Zagorskyi added a comment -

You observed only network traffic, but probably would be correct to watch CPU utilization after that SNMP walk.

Show
Oleksiy Zagorskyi added a comment - You observed only network traffic, but probably would be correct to watch CPU utilization after that SNMP walk.

People

Vote (0)
Watch (2)

Dates

  • Created:
    Updated:
    Resolved: