[ZBX-9201] History syncers can not process a lot of values for a single item fast enough Created: 2015 Jan 09 Updated: 2017 May 30 Resolved: 2016 Jan 18 |
|
Status: | Closed |
Project: | ZABBIX BUGS AND ISSUES |
Component/s: | Server (S) |
Affects Version/s: | 2.2.8 |
Fix Version/s: | None |
Type: | Incident report | Priority: | Critical |
Reporter: | Alexey Pustovalov | Assignee: | Unassigned |
Resolution: | Duplicate | Votes: | 13 |
Labels: | dbsyncer, history, log, performance | ||
Remaining Estimate: | Not Specified | ||
Time Spent: | Not Specified | ||
Original Estimate: | Not Specified |
Attachments: |
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
||||||||||||
Issue Links: |
|
Description |
If an item receives a lot of values per short time period, Zabbix history syncers can not process data in time. At the end history cache is full and Zabbix does not receive new value and does not process new items. |
Comments |
Comment by Aleksandrs Saveljevs [ 2015 Jan 09 ] |
This issue seems to be the same as |
Comment by Christian Reizlein [ 2015 Mar 30 ] |
I have experienced this issue several times, specially when a item of LOG monitor type start sending a lot of values, ie, more than 5000, zabbix get stuck trying to save those values to the database and doesn't store the value items cache until finish, creating and triggering lots of no data received for 5 minutes triggers. |
Comment by Filipe Paternot [ 2015 Apr 15 ] |
Yes, about the same problem here, a little bit different though. What we've managed to diagnose:
The solution we managed to come up with was to split the hot data from the cold one. We created a new pair of mysql servers that keep the trends and history longer. The env before the issue (all data: 7 days history, 15 months of cold data/trends):
The CLONED env (fresh history/trends tables):
The env AFTER the issue (all data: 7 days history, 1 month of cold data/trends):
I've attached the before.log and cloned.log, they both show strace, ltrace, lsof and few slices of zabbix_server.log. |
Comment by Oleksii Zagorskyi [ 2015 Apr 16 ] |
Filipe, your detailed stat is impressing, thanks! |
Comment by Filipe Paternot [ 2015 Apr 16 ] |
Hello Oleksiy, The initial problem was that i had a huge queue on Zabbix Server (proxies were delivering to server fast enough, as usual), and also we had thousands of triggers of nodata firing. This was were we started our diagnose. Our first step was to look at the DB. It was just fine. We had no queries running for more than 30 secs, and those that were the longest, were from frontend (tr_status.php). It has 24 processors, and load was just the usual, about 3.0. One funny thing we noticed afterwards, was that History Cache started to fill gradually about 36 hours before the issue (i added now two screenshots of this). We found that most queries that zabbix_server ran at that time were from this block of code (dbcache.c): if (0 != ids_num) { sql_offset = 0; zbx_snprintf_alloc(&sql, &sql_alloc, &sql_offset, "select distinct itemid" " from %s" " where clock>=%d and", table_name, clock); DBadd_condition_alloc(&sql, &sql_alloc, &sql_offset, "itemid", ids, ids_num); result = DBselect("%s", sql); Now, back to your question:
This (including the query it runs on trends and trends_uint) led us to conclude that for some reason db syncer sometimes needs info from trends. As we reduced drastically the size of it, the queries started to run fast enough again. |
Comment by Oleksii Zagorskyi [ 2015 Apr 16 ] |
Just in case - those two screenshots are identical, probably by mistake. On the screenshot, looking to a history cache "saw", it looks like strong 5 minutes interval peaks. It could be for example some log* item(s) with update interval 5 minutes. What is "DataSenderFrequency" for proxies ? Of course historical data size (history first of all) may have impact on db syncers as they work with history to calculate triggers and may ask database for history, Note that there are improvements regarding ValueCache The code block you mentioned is from "DCflush_trends" function (Purpose: flush trend to the database). |
Comment by Filipe Paternot [ 2015 Apr 16 ] |
My bad, just uploaded the correct #2, a day after the cache starts that behaviour. Since we didnt collect those screenshots of nvps and internal process busy, we can only have them on trends. I attached them here as well. Note that some data was lost during restarts and other attempts to restore the performance. Our proxies work in passive mode, not active, so we have ProxyDataFrequency=1 (about the same behavior, only limited by StartProxyPollers=3). In this case, we observed that history* tables did not impact so much on db syncer, as we have as much data now as we had then, and the problem does not occur (actually it seems a lot faster). In pratical terms, we only reduced the size of trends (and partitioned history* by day). Regarding During 100% history syncer process, we saw mostly that query from "DCflush_trends" (select distinct itemid from trends/trends_uint where...) hitting the database, and no insert on history nor trends. So, you are probabily right about it beeing heavier when zabbix_server is (re)started. I just still wonder why we need to perform select's on trends to insert history and why it locks the dbsyncer (or it locks the valuecache?). |
Comment by Oleksii Zagorskyi [ 2015 May 12 ] |
Current server behavior can cause events creation like discussed in ZBX-9432. |
Comment by Christian Reizlein [ 2015 May 12 ] |
That is exactly the problem we have, and sometimes it goes beyond 30 minutes, depending on the size of the flood, as we use proxies to get all our data, all the data sticks lagged on the proxy and all items have missing data for 30 minutes or more, causing a big havoc on all the triggers. As a remediation, or sort of prevention until this is truly fixed ( and data is sent/processed in parallel ) i had to do an external script that checks the pending records in the proxy and if they start pilling up just clean the offending log entries and send an alert. |
Comment by Filipe Paternot [ 2015 May 12 ] |
Christian, can you elaborate on that, or share the script? Oleksiy, i'm not sure what you mean by "3 log* items". Three x log x items? In out current env, we have pretty little (negligible) number of items of log and a bunch of unsigned. Of all triggers we saw fired up, all of them had the nodata() function in its expression, so we'd guess it has something to do with the History syncer not coping with necessary speed. Also, the events were fired up just fine, in line with this behavior. |
Comment by Christian Reizlein [ 2015 May 12 ] |
Filipe, i made a simple PHP script that just send me an email, but you can go from there, the important part is the QUERY on the Proxy database, and, as i use sqlite on those, below is the query I run: SQL Query SELECT count(*) As Registros, itemid FROM proxy_history WHERE proxy_history.id>(SELECT nextid FROM ids WHERE field_name='history_lastid') GROUP BY itemid HAVING Registros > 500 ORDER BY 1 DESC LIMIT 50; // Note the > 500 , that will list items with more than 500 records left behind on the local database, so you can act from there // Of course you can change that 500 to anything you wish. Hope that helps, i can attach the whole script if you prefer. |
Comment by Oleksii Zagorskyi [ 2015 May 12 ] |
Filipe, yes, I meant three items for log monitoring (log, logrt or eventlog - doesn't matter). Here is an example SQL to check how many values zabbix server got for an item in a period, applicable for zabbix-server DB only: select count(*), itemid from history_log WHERE clock between unix_timestamp('2015-05-09 06:00') and unix_timestamp('2015-05-09 07:00') GROUP BY itemid ORDER BY count(*) DESC LIMIT 20; This way you can check to make sure you indeed had single (or several) log items flood for a short period of time. |
Comment by Oleksii Zagorskyi [ 2015 May 21 ] |
Current issue was mentioned in ZBXNEXT-2823 - troubles with bulk data sending by zabbix_sender. |
Comment by Oleksii Zagorskyi [ 2015 May 29 ] |
I'm not very sure it should help well, but still ... I think it would be a good approach for all zabbix instances (which may be affected by current issue) to limit active agents to not send many values for single items and fast. Not many users understand what item update interval means and many used it as 60-300 seconds, which of course not very good. Next are parameters for agent config: BufferSend, BufferSize, MaxLinesPerSecond (may be overwritten in agent key param) I suggest to set BufferSend=1, BufferSize=2 (1-5), MaxLinesPerSecond=3 (1-5) Bad logs (which may store 10-100 lines every seconds for long time) is an unpredictable thing and usually do not worth to be delivered very fast to zabbix server at all. Correct me please if I'm wrong. |
Comment by Marc [ 2015 May 29 ] |
ZBXNEXT-1843 and |
Comment by Alexander Vladishev [ 2015 Sep 16 ] |
Related issue: |
Comment by Oleksii Zagorskyi [ 2015 Sep 16 ] |
Not related, but when working on this issue take a look to ZBXNEXT-2617 please. |
Comment by Marc [ 2015 Nov 26 ] |
Can it be excluded to be related to |
Comment by Andris Zeila [ 2015 Dec 14 ] |
This issue should be solved by history cache redesign in |
Comment by Oleksii Zagorskyi [ 2016 Jan 18 ] |
Andris, can we close this one as duplicate? |
Comment by Andris Zeila [ 2016 Jan 18 ] |
Looks like |
Comment by Oleksii Zagorskyi [ 2016 Jan 18 ] |
Closed as duplicate of |