[ZBX-4774] High cpu load with postgresql monitoring 5000 discovered snmp items Created: 2012 Mar 19  Updated: 2017 May 30  Resolved: 2012 Mar 20

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 2.0.0rc1
Fix Version/s: None

Type: Incident report Priority: Major
Reporter: Cristian Mammoli Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: lld
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OS: CentOS 6.2 x86_64
DB: postgresql-server-8.4.9
Hardware: VMWare virtual machine with 2GB RAM and 2 vCPU

Zabbix server is running Yes localhost:10051
Number of hosts (monitored/not monitored/templates) 129 47 / 0 / 82
Number of items (monitored/disabled/not supported) 8344 4235 / 3533 / 576
Number of triggers (enabled/disabled)[problem/unknown/ok] 3360 3354 / 6 [39 / 0 / 3315]
Required server performance, new values per second 67.04 -


Attachments: PNG File screen.png    

 Description   

Hi, I'm using postgresql and the new 2.0 low level discovery to monitor some switches.
Everything is fine with most of them, the items are discovered and monitor fine without much load on the server. But one is a 9 unit stack with hundreds of ports.
After discovery I have 5173 items and 2792 triggers. The database starts to eat CPU (more than 80% all the time) and the zabbix queue starts to fill up losing events...
The very strange thing is that this happens even if I keep all the items disabled!!! Just adding them to the host kills the db.
This didn't happen with 1.8 monitoring the same items (with a script generated template)



 Comments   
Comment by Oleksii Zagorskyi [ 2012 Mar 19 ]

Which update interval for the discovery rule is used at the moment?
How many similar discovery rules (for big switches) do you have?
I suppose for similar cases should be used the big update interval like one hour or even more.

Comment by Cristian Mammoli [ 2012 Mar 19 ]

Port

{#SNMPVALUE} AdminStatus IF-MIB-ifAdminStatus.[{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE}

Alias IF-MIB-ifAlias.[

{#SNMPINDEX}] 1800 30 SNMPv2 agent Enabled Network
Port {#SNMPVALUE} Collisions .1.3.6.1.4.1.9.2.2.1.1.25.[{#SNMPINDEX}

] 1800 30 365 SNMPv2 agent Enabled Network
Port

{#SNMPVALUE} Description IF-MIB-ifDescr.[{#SNMPINDEX}] 1800 30 SNMPv2 agent Enabled Network Port {#SNMPVALUE}

InCRC .1.3.6.1.4.1.9.2.2.1.1.12.[

{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network Port {#SNMPVALUE} InErrors IF-MIB-ifInErrors.[{#SNMPINDEX}

] 1800 30 365 SNMPv2 agent Enabled Network
Port

{#SNMPVALUE} InNUcastPkts IF-MIB-ifInNUcastPkts.[{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE}

InOctets IF-MIB-ifHCInOctets.[

{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE} InUcastPkts IF-MIB-ifHCInUcastPkts.[{#SNMPINDEX}

] 180 30 365 SNMPv2 agent Enabled Network
Port

{#SNMPVALUE} OperStatus IF-MIB-ifOperStatus.[{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE}

OutErrors IF-MIB-ifOutErrors.[

{#SNMPINDEX}] 1800 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE} OutNUcastPkts IF-MIB-ifOutNUcastPkts.[{#SNMPINDEX}

] 180 30 365 SNMPv2 agent Enabled Network
Port

{#SNMPVALUE} OutOctets IF-MIB-ifHCOutOctets.[{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE}

OutUcastPkts IF-MIB-ifHCOutUcastPkts.[

{#SNMPINDEX}] 180 30 365 SNMPv2 agent Enabled Network
Port {#SNMPVALUE} Speed IF-MIB-ifSpeed.[{#SNMPINDEX}

] 3600 30 365 SNMPv2 agent Enabled Network

Here you are, 180 secs for most checks. I don't think that the interval is the issue anyway, as I said the high load happens even with all items disabled! And what's the point of monitoring port traffic ONCE in a hour

Comment by Cristian Mammoli [ 2012 Mar 19 ]

Sorry, I didn't understand you meant the discovery rule interval and not the items! It's every 3600 secs

Comment by Oleksii Zagorskyi [ 2012 Mar 19 ]

Cristian, I meant the discovery rule but not item prototypes. They are different things.
And the update interval for discovery rules is VERY important for your big switches.

Find a text:
The field “Update interval (in sec)” specifies how often Zabbix performs discovery. In the beginning, when you are just setting up file system discovery, you might wish to set it to a small interval, but once you know it works you can set it to 30 minutes or more, because file systems usually do not change very often.
here:
http://www.zabbix.com/documentation/2.0/manual/discovery/low_level_discovery

Comment by Cristian Mammoli [ 2012 Mar 19 ]

I replied right above you, but I don't have "load spikes" very hour: the load is continuous. I can set the discovery interval to something like 86400 and see if things get better

Comment by Oleksii Zagorskyi [ 2012 Mar 19 ]

> but I don't have "load spikes" very hour: the load is continuous. I can set the discovery interval to something like 86400 and see if things get better

Yes, try it.
5173 discovered items (I suppose it's count of interfaces, maybe even filtered) it's not a joke but enough serious task.

Comment by Cristian Mammoli [ 2012 Mar 19 ]

zabbix 2.0 cpu load with postgresql

Comment by Cristian Mammoli [ 2012 Mar 19 ]

As you can see from the attached image as soon as I added the switch the load skyrocketed adn didn't drop for many hours, so I don' think the discovery interval is the issue here.

As a test I dumped all the configuration and reimported into mysql. The load dropped from 1.6 to 0.3. I'll keep testing with mysql but I think the problem is linked to pgsql.

Thanks

Comment by Cristian Mammoli [ 2012 Mar 20 ]

I was being too optimist, I still have heavy load with MySQL but the situation is way better: average load around 0.7 as soon as I add the switch to the template. I still have the pgsql db in place so if you need some data just ask

Comment by Cristian Mammoli [ 2012 Mar 20 ]

Well, shame on me: I had a flexible interval of 50 secs on the discovery rule. Removed it and the load now is around 0.2. Thank Oleksiy for your time and again sorry. You can close.

Comment by Oleksii Zagorskyi [ 2012 Mar 20 ]

Issue closed as Cannot reproduce

Comment by Oleksii Zagorskyi [ 2012 Mar 20 ]

Cristian, btw, would be interesting to know how many time zabbix server spends to discover and process single discovery rule with creation 5173 items and 2792 triggers.
Could you somehow measure?

Comment by Cristian Mammoli [ 2012 Mar 20 ]

Well I can create a new empty db and only import the discovery template. I'll do some tests this evening and let you know

Bye

Comment by Cristian Mammoli [ 2012 Mar 21 ]

I created an empty db and populated it with the schema and so on, then I stopped zabbix_server and started with the new db.
Imported the discovery template, added the switch to it and logged everything with tcpdump. Surprisingly it lasted only 2 seconds to snmpwalk the switch and populate the items:

2012-03-21 20:43:04.946398 IP srvzabbix.xxxxx.xx.52670 > sw3570racka.xxxxx.xx.snmp: GetNextRequest(30) 31.1.1.1.1
...
2012-03-21 20:43:06.160733 IP sw3570racka.xxxxx.xx.snmp > srvzabbix.xxxxx.xx.52670: GetResponse(35) 31.1.1.1.1.14501="Nu0"

So I don't understand why with a discovery every 50 secs it was putting so much load on the db

Comment by Oleksii Zagorskyi [ 2012 Mar 21 ]

You observed only network traffic, but probably would be correct to watch CPU utilization after that SNMP walk.

Generated at Thu Apr 25 07:23:41 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.