[ZBXNEXT-4661] Excessive unexplained value cache hits leads to poor zabbix server performance Created: 2018 Aug 01  Updated: 2018 Aug 17  Resolved: 2018 Aug 17

Status: Closed
Project: ZABBIX FEATURE REQUESTS
Component/s: Server (S)
Affects Version/s: 3.4.11
Fix Version/s: None

Type: Change Request Priority: Major
Reporter: James Cook Assignee: Unassigned
Resolution: Won't fix Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Zabbix 3.4.11 Server, 8 * Zabbix 3.4.11 Proxies
Centos 7
Postgresql 10


Attachments: JPEG File Zabbix Server - Configuration Cache.JPG     JPEG File Zabbix Server - History Cache.JPG     JPEG File Zabbix Server - History Syncer Process.JPG     JPEG File Zabbix Server - Timer Process.JPG     JPEG File Zabbix Server - Value Cache Hits.JPG     JPEG File Zabbix Server - Value Cache.JPG     JPEG File Zabbix Server - Value Performance.JPG     JPEG File Zabbix Server Monitored Hosts Items Triggers.JPG     XML File Zabbix Template With Yesterdays changes.xml    

 Description   

Hi,

We have this issue where there is an unexplained huge jump in Value Cache hits in which we then experience a build up in the History cache until we restart the server. This happens regularly ie every week or two.

I have included several graphs for the last 7 day period. Clearly you can see at approx 2-3pm yesterday there was a massive jump in value cache hits without a massive jump in monitoring configuration and the history cache starting to rise at the exact same time.

I have attached several graphs which show this value cache hits, process performance, cache capacity, value performance and monitoring configuration graphs. 

Regards

James



 Comments   
Comment by James Cook [ 2018 Aug 01 ]

Hi,

I have attached the value cache graph as well to show how the value cache did not really increase during that period, which may suggest something is actually hitting the value cache to frequently rather than returning to much data.

Cheers

James

Comment by James Cook [ 2018 Aug 01 ]

Hi,

I would like to know if there is a way to dump out what is in the value cache and what is in the configuration cache in order to identify potential triggers in memory that could be causing the excessive hits?

Cheers

James

Comment by James Cook [ 2018 Aug 01 ]

Hi, 

I have attached the template that was applied to 750 hosts yesterday at 2-3pm which appears to correlate with the issue.

We have tons of identicle triggers that do not cause the same issue so I am puzzled why this is the case.

I also removed this template that was applied to the hosts yesterday and even after a configuration cache reload the issue still exists.

I am wandering if the triggers have actually been deleted in the live configuration cache which is why I would like to dump a list of triggers in the live configuration cache?

Cheers

James

Comment by Alexey Pustovalov [ 2018 Aug 01 ]

James,

Unfortunately it is impossible to dump value cache. I suppose the reason is log items. Do you have log monitoring?

Comment by James Cook [ 2018 Aug 01 ]

Hi Alexey,

We do monitor the syslog (/var/log/messages) on our linux systems and some light windows event log, however this has been in place for years.

Cheers

James

Comment by Alexey Pustovalov [ 2018 Aug 01 ]

James,

I did not say that is there something new, maybe you changed trigger expression or sometimes log monitoring receives much more records than usually. You can check it in history_log table.

Comment by James Cook [ 2018 Aug 01 ]

Hi Alexey,

 

Sorry misinterperation...

 

I have counted up the history_log rows per day as the following

 

26/07/2018 - 56790

27/07/2018 - 65303

28/07/2018 - 52207

29/07/2018 - 51363

30/07/2018 - 65147

31/07/2018 - 256293

01/08/2018 - 41361 (to now only 11 hours)

 

There does seem to be a increase yesterday sometime so what I will do is look if the increase happened during 2-3 pm where the graph shows the increase.

 

If so I will then identify what actual items have increased and look at disabling it temporarily etc...

 

Cheers

James

Comment by Alexey Pustovalov [ 2018 Aug 01 ]

James,

It is better to add group by itemid + clock (by hour). It will help you understand which item is guilty.

Comment by James Cook [ 2018 Aug 01 ]

Hi Alexey,

Wow you were spot on...

I found an individual item that had increased its submission rate (using SQL)....

I disabled the trigger and it went straight back to normal.... 

I will keep an eye on it for a couple of hours and then we can close it.

Something for me to look at in the future for similar problems.

Cheers

James

Comment by Alexey Pustovalov [ 2018 Aug 01 ]

Ok. Also, please, do not forget support.zabbix.com is used for bug reports, not configuration problems, performance problems and etc Please use forum in future for such cases.

Comment by James Cook [ 2018 Aug 01 ]

Hi Alexey,

No problems and I will remember ... Thumbs up for being so quick.

Cheers

James

Generated at Wed Apr 16 06:10:16 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.