[ZBX-10059] Zabbix Server Caching Issue of New Unsupported Items causing Queing Created: 2015 Nov 09  Updated: 2024 Jan 30  Resolved: 2024 Jan 30

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Proxy (P), Server (S)
Affects Version/s: 3.0.0alpha3
Fix Version/s: None

Type: Incident report Priority: Trivial
Reporter: Kenneth Palmertree Assignee: Unassigned
Resolution: Fixed Votes: 12
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

pre-zabbix-3.0.0alpha5-56596


Attachments: PNG File screenshot-1.png     File zabbix-patch-3.0.0-57353.diff    
Issue Links:
Duplicate
is duplicated by ZBX-23688 Do not run heavy queries to history t... Closed

 Description   

In a server/proxy setup, adding new templates to a host monitored by a proxy that have a lot of unsupported items causes queuing on the Zabbix server. Zabbix Server is trying to retrieve values for cache for the first time and scans the entire database. After all the unsupported items are processed to retrieve cache, zabbix server starts to process things normal. Having about 30 or 40 unsupported items causing queuing for over 30 minutes.

  2439:20151109:004113.625 slow query: 3.200867 sec, "select clock,ns,value from history_uint where itemid=49920 and clock<=1447043955 and clock>1446957555 order by clock desc limit 9"
  2442:20151109:004113.631 slow query: 3.195073 sec, "select clock,ns,value from history_uint where itemid=50051 and clock<=1447043957 and clock>1446957557 order by clock desc limit 1"
  2442:20151109:004124.706 slow query: 11.074959 sec, "select clock,ns,value from history_uint where itemid=50051 and clock<=1446957557 and clock>1446352757 order by clock desc limit 1"
  2439:20151109:004124.706 slow query: 11.080613 sec, "select clock,ns,value from history_uint where itemid=49920 and clock<=1446957555 and clock>1446352755 order by clock desc limit 9"
  2442:20151109:004134.264 slow query: 9.558351 sec, "select clock,ns,value from history_uint where itemid=50051 and clock<=1446352757 and clock>1443760757 order by clock desc limit 1"
  2439:20151109:004134.264 slow query: 9.558318 sec, "select clock,ns,value from history_uint where itemid=49920 and clock<=1446352755 and clock>1443760755 order by clock desc limit 9"
  2439:20151109:004140.675 slow query: 3.200069 sec, "select clock,ns,value from history_uint where itemid=49977 and clock<=1447043955 and clock>1446957555 order by clock desc limit 1"
  2442:20151109:004140.677 slow query: 3.202882 sec, "select clock,ns,value from history_uint where itemid=50049 and clock<=1447043957 and clock>1446957557 order by clock desc limit 9"
  2442:20151109:004151.735 slow query: 11.057315 sec, "select clock,ns,value from history_uint where itemid=50049 and clock<=1446957557 and clock>1446352757 order by clock desc limit 9"
  2439:20151109:004151.735 slow query: 11.060376 sec, "select clock,ns,value from history_uint where itemid=49977 and clock<=1446957555 and clock>1446352755 order by clock desc limit 1"
  2442:20151109:004201.285 slow query: 9.549839 sec, "select clock,ns,value from history_uint where itemid=50049 and clock<=1446352757 and clock>1443760757 order by clock desc limit 9"
  2439:20151109:004201.285 slow query: 9.549614 sec, "select clock,ns,value from history_uint where itemid=49977 and clock<=1446352755 and clock>1443760755 order by clock desc limit 1"
  2442:20151109:004207.564 slow query: 3.119583 sec, "select clock,ns,value from history where itemid=50058 and clock<=1447043957 and clock>1446957557 order by clock desc limit 9"
  2439:20151109:004207.702 slow query: 3.203791 sec, "select clock,ns,value from history_uint where itemid=49933 and clock<=1447043955 and clock>1446957555 order by clock desc limit 9"
  2442:20151109:004218.657 slow query: 11.092807 sec, "select clock,ns,value from history where itemid=50058 and clock<=1446957557 and clock>1446352757 order by clock desc limit 9"
  2439:20151109:004218.794 slow query: 11.091753 sec, "select clock,ns,value from history_uint where itemid=49933 and clock<=1446957555 and clock>1446352755 order by clock desc limit 9"
  2442:20151109:004228.143 slow query: 9.485291 sec, "select clock,ns,value from history where itemid=50058 and clock<=1446352757 and clock>1443760757 order by clock desc limit 9"
  2439:20151109:004228.364 slow query: 9.569832 sec, "select clock,ns,value from history_uint where itemid=49933 and clock<=1446352755 and clock>1443760755 order by clock desc limit 9"
  2442:20151109:004234.550 slow query: 3.223907 sec, "select clock,ns,value from history_uint where itemid=50052 and clock<=1447043957 and clock>1446957557 order by clock desc limit 9"
  2439:20151109:004234.786 slow query: 3.208124 sec, "select clock,ns,value from history_uint where itemid=49923 and clock<=1447043955 and clock>1446957555 order by clock desc limit 9"
  2442:20151109:004245.629 slow query: 11.078849 sec, "select clock,ns,value from history_uint where itemid=50052 and clock<=1446957557 and clock>1446352757 order by clock desc limit 9"
  2439:20151109:004245.863 slow query: 11.077394 sec, "select clock,ns,value from history_uint where itemid=49923 and clock<=1446957555 and clock>1446352755 order by clock desc limit 9"
  2442:20151109:004255.229 slow query: 9.599728 sec, "select clock,ns,value from history_uint where itemid=50052 and clock<=1446352757 and clock>1443760757 order by clock desc limit 9"
  2439:20151109:004255.428 slow query: 9.564944 sec, "select clock,ns,value from history_uint where itemid=49923 and clock<=1446352755 and clock>1443760755 order by clock desc limit 9"
  2442:20151109:004301.695 slow query: 3.235566 sec, "select clock,ns,value from history_uint where itemid=50057 and clock<=1447043957 and clock>1446957557 order by clock desc limit 9"
  2439:20151109:004301.850 slow query: 3.207687 sec, "select clock,ns,value from history_uint where itemid=49938 and clock<=1447043955 and clock>1446957555 order by clock desc limit 1"
  2442:20151109:004312.836 slow query: 11.141588 sec, "select clock,ns,value from history_uint where itemid=50057 and clock<=1446957557 and clock>1446352757 order by clock desc limit 9"
  2439:20151109:004312.943 slow query: 11.092860 sec, "select clock,ns,value from history_uint where itemid=49938 and clock<=1446957555 and clock>1446352755 order by clock desc limit 1"
  2442:20151109:004322.445 slow query: 9.608494 sec, "select clock,ns,value from history_uint where itemid=50057 and clock<=1446352757 and clock>1443760757 order by clock desc limit 9"
  2439:20151109:004322.494 slow query: 9.550972 sec, "select clock,ns,value from history_uint where itemid=49938 and clock<=1446352755 and clock>1443760755 order by clock desc limit 1"
  2439:20151109:004328.883 slow query: 3.193238 sec, "select clock,ns,value from history_uint where itemid=49942 and clock<=1447043954 and clock>1446957554 order by clock desc limit 9"
  2442:20151109:004328.894 slow query: 3.222520 sec, "select clock,ns,value from history_uint where itemid=50048 and clock<=1447043957 and clock>1446957557 order by clock desc limit 1


 Comments   
Comment by Andris Zeila [ 2015 Nov 09 ]

Could you please answer to few questions:

  1. Do you have actions based on internal notifications to report unsupported items? If so - what is the action subject/message ?
  2. Could you check the Zabbix server Zabbix value cache operating mode item? Zabbix should not read the same history data from database unless value cache is working in low memory mode.
Comment by Kenneth Palmertree [ 2015 Nov 10 ]

No problem Andris. Thanks for the help.

This situation only occurs when you add a new template and if there are a lot (50 or more) of items that are unsupported for the first time. Since it is a brand new item and unsupported, Zabbix scans the entire history table to look for values since it can not find any in the first hour. Scans can take up to 10 secs run per unsupported item with no history data. I did not start seeing the issue until my history table grew in size (about 30+ GB).

1. Do you have actions based on internal notifications to report unsupported items?
If so - what is the action subject/message ?

I just added a new template to a host that has unsupported items and captured the following information:

Not supported: HOST_1:mp CMS Perm Gen used
Host: HOST_1
Item: mp CMS Perm Gen used
Key: jmx["java.lang:type=MemoryPool,name=CMS Perm Gen",Usage.used]
State: Not supported

2. Could you check the Zabbix server Zabbix value cache operating mode item?

It was normal the whole time. Attaching the graph..

Comment by Andris Zeila [ 2015 Nov 10 ]

What do you mean with 'unsupported for the first time' ? Item will change to unsupported state only after a check has been made. So new items should not have unsupported state.

The database selects were made by 2442 and 2439 processes, could you please check what type of processes they were (history syncers, escalators etc) ?

Regarding caching - simply adding new items should not cause caching. Caching will be be done when item value is requested either by trigger functions or notification macros. And trigger functions are not calculated if item changes its state to unsupported. So in theory it leaves only actions based on internal events. However ITEM.VALUE, ITEM.LASTVALUE macros that could cause value cache access are not supported in internal event based notifications.

Judging from the selects Zabbix was trying to get 1 and 9 last values from value cache. While reading 1 last value could be a result of last() function, which is commonly used in trigger expressions, reading last 9 values looks to be something more specific. Are you using any trigger functions based on last 9 item values?

Comment by Kenneth Palmertree [ 2015 Nov 11 ]

Andris, thank you. You gave me an idea and it was the triggers like you said. My triggers were using max(#10) and when I changed them to max(10m) the issue went away mostly but still have a few database scan from the last(0) function. I am assuming Zabbix server is trying to scan the database to look for the last 10 polls and ends up scan it until it reaches the end since there are no values in the database since the item is Not supported. Using 10 minutes it just looks at the last 10 minute time interval and less load on the database.

I am still seeing database scans for Not supported items when I use last(0) function in a trigger and those are the slow queries with limit 1 at the end from doing some comparisons in the database.

Example of trigger before change:
{-TMP.JMX.GENERIC:jmx["java.lang:type=Memory",HeapMemoryUsage.used].min(#10)}>({-TMP.JMX.GENERIC:jmx["java.lang:type=Memory",HeapMemoryUsage.max].last(0)}*0.95)

After trigger change:
{-TMP.JMX.GENERIC:jmx["java.lang:type=Memory",HeapMemoryUsage.used].min(10m)}>({-TMP.JMX.GENERIC:jmx["java.lang:type=Memory",HeapMemoryUsage.max].last(0)}*0.95)

Thanks for your suggestion and explaining how the cache works. I will probably just stay away from using #num in the trigger functions. Not sure what to do about last(0) though.

Comment by Andris Zeila [ 2015 Nov 11 ]

That would explain the database requests. When zabbix needs number of history values it first tries to read the requested values from last hour, then day, then month ... (this was optimized for partitioned tables).

Triggers are recalculated when item used in trigger expression receives a new value. So for a new items it will have one value. To calculate min(#10) zabbix will request 10 last values from value cache. If the item is not yet cached value cache will attempt to read 10 values from last hour. After getting 1 value the cache will try to get the other 9 values from last day, month...

There is nothing much we can do about last(). Caching at least one value for all items would be excessive. Storing total number of history values for new items might help, but there is no easy place/way to do that.

Comment by Kenneth Palmertree [ 2015 Dec 31 ]

Just wanted to share this in case it helps anyone else . I found a fix for this. The problem is to do with MySQL and "late row lookups". Please look at ZBX-10229 for more information about late row lookups. Late row lookups in MySQL (http://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/) caused the whole history table to be scanned. Attached, zabbix-patch-3.0.0-57353.diff, is my patch that fixed my issue and improved the database performance.

Comment by Filipe Paternot [ 2016 Aug 02 ]

Zabbix 3.0.3

With 30min timeout on php and nginx, i just finished one template update:

  • update template A linked to some 3k hosts
  • add template B to template A
  • update

And...

******************** Script profiler ********************
Total time: 280.076568
Total SQL time: 53.552156
SQL count: 78270 (selects: 39196 | executes: 39074)
Peak memory usage: 2817M
Memory limit: 8G

Running php with xcache, nginx with gzip, and few other optimizations.
Database with 64G ram, 24 cores, mysql 5.6.12, ~250G partitioned history* and trends* tables, zabbix server writting some ~3knvps and nothing else. Frontend was in maintenance to reduce load/concurrency. I was the only one logged in. Database load was ~1 and 0.1 iowait. Pretty much the same as the frontend itself.

Comment by Bruno Galindro da Costa [ 2017 Mar 29 ]

Guys,

Will this patch be applyed to Zabbix 3? I'm having the same problem after upgrading Zabbix Server from 2.4.7 to 3.2.4, but in my case, I'm not using proxy, just a server.

Comment by Fabricio H [ 2017 Mar 29 ]

Hi,
I have this same problem with the version 3.2.4!!!

Comment by Claudio Lima [ 2017 Mar 29 ]

Hello guys
I'm having the same problem with the 3.2.4 version

Comment by Andris Zeila [ 2017 Mar 29 ]

Itemid + clock is not guaranteed to be unique. While in some setups joining on itemid+clock might be fine, this can't be used in generic installations. Also while this patch might improve Mysql performance, there is no telling what would be the effect with other supported databases (postgresql, oracle, db2).

So unfortunatelly this patch can't be pushed into upstream. However it does give ideas for possible optimizations.

Comment by Eduardo Garcia [ 2017 Mar 29 ]

Same issue here. Also with version 3.2.4.

Comment by Andris Zeila [ 2017 Mar 29 ]

Just to clarify, the 'same issue' - is the issue initially discussed here, and not the template update performance problem reported by Filipe ?
Thanks.

Comment by Eduardo Garcia [ 2017 Mar 29 ]

Yes, it's the issue initially discussed here.

Comment by Bruno Galindro da Costa [ 2017 Mar 29 ]

Yes @Andris

Comment by Rodrigo Gonçalves [ 2017 Mar 29 ]

I'm having the same issue with Zabbix 3.2.4.

Comment by richlv [ 2017 Mar 29 ]

please see http://zabbix.org/wiki/Docs/bug_reporting_guidelines , particularly the part that starts with 'Don't add "me too"'
remember, if zabbix developers have to read through many "me too" comments, they don't have the time to fix problems.

Comment by Vladislavs Sokurenko [ 2024 Jan 30 ]

This should be fixed by ZBX-23688

Generated at Tue Apr 29 08:18:04 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.