[#ZBX-9532] VMware connection issues may apparently cause busy history syncers

[ZBX-9532] VMware connection issues may apparently cause busy history syncers Created: 2015 May 02 Updated: 2017 May 30 Resolved: 2015 Nov 23
Status:	Closed
Project:	ZABBIX BUGS AND ISSUES
Component/s:	Proxy (P), Server (S)
Affects Version/s:	2.2.9
Fix Version/s:	None

Type:

Incident report

Priority:

Major

Reporter:

Marc

Assignee:

Unassigned

Resolution:

Duplicate

Votes:

0

Labels:

performance, vmware

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original Estimate:

Not Specified

Attachments:

OkDatabaseTuplesFetched.png

OkServerBusy.png

ProblemDatabaseTuplesFetched.png

ProblemDatabaseTuplesInserts.png

ProblemProxyCache.png

ProblemProxyNvps.png

ProblemServerBusy.png

ProblemServerCache.png

ProblemServerNvps.png

zbx9532-20150601.png

Issue Links:

Duplicate
duplicates	~~ZBXNEXT-3051~~	Count of actions has a significant im...	Closed

Description

Time based triggers produced false-positive alerts at the same time for a couple of days.
These are obviously caused by busy history syncer processes.

While everything seams to be fine on Zabbix server side (resources, database, nvps, ...) there were thousands of suspicious VMware related log messages that matched exactly in time:

--- SNIP zabbix_server.log ---

   895:20150430:020102.946 item "50196356-038e-d219-ca5b-52460e478ba1:vmware.vm.memory.size.compressed[{$URL},{HOST.HOST}]" became not supported: Couldn't connect to server
   895:20150430:020102.946 item "502d13c2-bffe-b0e7-c4a2-2232b8f5d214:vmware.vm.memory.size.compressed[{$URL},{HOST.HOST}]" became not supported: Couldn't connect to server
   895:20150430:020102.946 item "502df28b-5daf-e7a4-cb42-f874507238f4:vmware.vm.memory.size[{$URL},{HOST.HOST}]" became not supported: Couldn't connect to server
   895:20150430:020102.946 item "5019e168-2dc5-223d-a82c-c857b98698c3:vmware.vm.memory.size.compressed[{$URL},{HOST.HOST}]" became not supported: Couldn't connect to server
   895:20150430:020102.946 item "5019e5ed-810f-dfa8-9577-9127c99439b3:vmware.vm.uptime[{$URL},{HOST.HOST}]" became not supported: Couldn't connect to server
   895:20150430:020102.946 item "502d8262-815b-8ff3-c7c5-7ef0e61bbdb9:vmware.vm.memory.size.compressed[{$URL},{HOST.HOST}]" became not supported: Couldn't connect to server

--- SNAP zabbix_server.log ---

After creating a maintenance period with no data collection (1h at 02:00am and 04:00am) for all VMware hosts, the Zabbix server was not suffering from busy history syncer processes anymore.

Environment

1 vCenter
20 Hypervisors
>350 Virtual Machines
VMware monitoring done by Zabbix proxy

Comments

Comment by Oleksii Zagorskyi [ 2015 May 03 ]

Original Marc's attachments, just positioned for better view:

Comment by Oleksii Zagorskyi [ 2015 May 03 ]

You attached graphs from server and a proxy.
We see a NVPS drop for ~200 on the proxy and respectively also on the server.

Suspicious VMware related log messages are from server log.
Are they indeed from server ?
Why then we see the NVPS drop on the proxy side too ?

We see caches behavior on proxy, but would be good to see them for server too.

I cannot get where is the original problem - on server or proxy ?
Or on both?

Comment by Marc [ 2015 May 03 ]

Yes, the messages are indeed from the server's log.

We see the NVPS drop on proxy side too, because of not being polled anymore due to connection issues (between Zabbix proxy and vCenter).
Without knowing the actual cause I assume that the large count of unsupported items is somehow causing a significant load - either directly or indirectly

I already briefly checked the count of history values grouped by item on the server's database for the time in question - nothing suspicious

The caches on server side show the expected picture. Since history syncer processes reach their limit or are blocked, data gets not fast enough written out of the caches. thus time based triggers are falsely firing on the longer periods.

Comment by Marc [ 2015 May 03 ]

Not sure whether it's relevant or not but there were significant more tuples fetched during the issue.

Without having the previously mentioned maintenance period configured

Having the maintenance period configured

Edit:
One assumption was that there might be a lot of values re-send on re-filling the VMware cache, sending log based values again or for whatever other reasons.
A look at the tuple inserts shows that there are interrupts and peaks but no more data inserted in total.

Comment by Marc [ 2015 May 04 ]

This issue might be a duplicate of ~~ZBX-9466~~.

Comment by Marc [ 2015 Jun 01 ]

Recently we made some optimizations on the database system used for Zabbix and I thought whether it might be worth to disable the current maintenance created due to this issue.

Well, no need to further think about it anymore since the maintenance period ended today (what i completely forgot). Apparently I thought somehow to get it fixed quicker:

Edit:
To clarify, This proves that this issue is VMware related and a maintenance period without data collection prevents it.

Comment by Oleksii Zagorskyi [ 2015 Jun 26 ]

Lets close current issue in favor of ~~ZBX-9466~~, at it was reported early.

Comment by richlv [ 2015 Jun 26 ]

should we close this as a duplicate of ~~ZBX-9466~~ then ?

Comment by Marc [ 2015 Nov 23 ]

Reopend to close as duplicate of ~~ZBXNEXT-3051~~

Generated at Tue Apr 23 11:46:26 EEST 2024 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.