[ZBX-26116] Excessive Event Log Data from vmware.eventlog key crashes zabbix Created: 2025 Feb 28  Updated: 2025 Apr 09

Status: Confirmed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 7.0.9
Fix Version/s: None

Type: Problem report Priority: Blocker
Reporter: Michal Kudlacz Assignee: Michael Veksler
Resolution: Unresolved Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File image-2025-03-13-11-47-48-720.png     PNG File image-2025-03-18-14-31-53-964.png     PNG File image-2025-03-18-14-32-15-291.png    
Issue Links:
Duplicate
Sprint: Support backlog

 Description   

While analyzing zabbix_server -R diaginfo, identified three item IDs dominating the history write cache. Found that all three corresponded to the official vmware.eventlog[url,skip] key, assigned to three different VMware hosts.

  • collected value counts were 1.2M, 600K, and 400K
  • it ocurred within less than an hour of Zabbix server uptime
  • impacted performance and agents availability

Fix:
Disabled the item at the template level, preventing further event log collection.
Within minutes, observed a significant improvement in history syncers' performance

The issue is that if someone who is unaware of that downloads the updated template and set up vmware monitoring, can crash the entire infrastructure.



 Comments   
Comment by Marco Hofmann [ 2025 Mar 03 ]

On Friday, we upgraded from 6.4.21 to 7.0.10. Since then, 3 out of 9 vCenter report a Timeout on the item 'vmware.eventlog[\{$VMWARE.URL},skip]'.

I suspect this might be related, but for us, it doesn't crash, but never finishes.

Comment by Michael Veksler [ 2025 Mar 03 ]

Hi starko,

Please confirm that you are successfully receiving event log for all vc before upgrading.

I suspect that the 3 vc with issues have not sent events for a long time and the delayed events are overloading the server (when they started coming in)

Comment by Marco Hofmann [ 2025 Mar 04 ]

Hi MVekslers ! Long time no read

I try to gather as much information as possible.

Zabbix server 7.0.10 was exactly started for the first time on:
2466:20250228:132331.216 Starting Zabbix Server. Zabbix 7.0.10 (revision 237358b56b2).
28.02.2025 13:23:41

The Zabbix proxies were all upgraded to 7.0.10 a few hours later with saltstack. The last one at about 28.02.2025 17:00:00.

Our main vCenter datacenter, with 33 vSphere Hosts and 787 virtual machines, still reports its event log without issues.

One of our key customers has a vCenter datacenter with 6 vSphere hosts and 108 virtual machines, still reports its event log without issues. 

The reason I know event log monitoring stopped working is not because I double-checked latest data, it's because I created a nodata trigger for vmware.eventlog years ago:

VMware: Event log monitoring stopped

nodata(/vcenter.customer.de/vmware.eventlog[{$VMWARE.URL},skip],8h)=1

Affected customer 1 has stopped reporting events 2025-03-01 10:28:22 PM

Affected customer 2 stopped reporting events 2025-03-01 11:06:42 PM

Affected customer 3 has stopped reporting events 2025-03-02 08:02:26 PM

All three have stopped reporting event logs several hours after the Zabbix Proxy upgrade.

Please tell me, what else debug information you need.

EDIT 04.03.2025 11:48:

I have to apologize, the error is gone. 

It just came to my mind, that I haven't restarted all Zabbix proxy VMs after the upgrade from 6.4.21 -> 7.0.10 and therefore I just rebooted the three responsible Debian Linux VMs. 

A few minutes later the nodata Trigger Events closed themselves as new event log data was being delivered.

Comment by Michael Veksler [ 2025 Mar 04 ]

Hi starko,

Thanks for info

Comment by Michael Veksler [ 2025 Mar 07 ]

Hi starko,
Could you estimate the number of events per day for all vc?
I am thinking about introducing some rate limits to protect the history cache

Comment by Marco Hofmann [ 2025 Mar 13 ]

Sorry for the late update, but I had to be sure. The nodata Trigger for the vmware.eventlog item key triggers for (nearly) each vCenter every other day. So there is defenetly something wrong since switching from 6.4.21 to 7.0.10.

 

How can I provide the data you need to troubleshoot this?

Comment by Michael Veksler [ 2025 Mar 13 ]

To protect HistoryCache a "fuse" was introduced that stop "even collection" when HistoryCache usage > 80%
Can you check usage of HistoryCache during the problem and increase value HistoryCacheSize in case usage is around 80%

Comment by Marco Hofmann [ 2025 Mar 18 ]

Please tell me, if these are the values you need. If not, please tell me, where to find them.

Zabbix Proxy is 7.0.10 on Debian 12. Zabbix proxy is monitored with version 7.0-1 of the official template: https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/zabbix_proxy?at=refs%2Fheads%2Frelease%2F7.0

For HistoryCache, I guess you mean the metrics of the Zabbix Proxy? If yes, this Zabbix proxy is affected (vmware.eventlog nodata) at the moment, the values look not relevant to the problem:

Comment by Michael Veksler [ 2025 Mar 19 ]

Hi starko,

Please increase the log level:

zabbix_server -R log_level_increase="vmware collector" 

Send me a log with the problem, please.

Btw, can you test dev build ? (I am not sure that current logs are enough)

Comment by Marco Hofmann [ 2025 Mar 21 ]

I send you the VMware debug log via E-Mail.

I don't think I can test a dev build, I only have access to this prod environment, I'm sorry.

Comment by Vladislavs Sokurenko [ 2025 Apr 09 ]

Do you have any information how much memory was used for example by preprocessing manager ? mkudlacz ?

Comment by Vladislavs Sokurenko [ 2025 Apr 09 ]

Shouldn't template also be updated so it have some kind of filtering for event log or be disabled by default ?

Generated at Thu Apr 10 10:37:56 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.