[ZBX-9201] History syncers can not process a lot of values for a single item fast enough Created: 2015 Jan 09  Updated: 2017 May 30  Resolved: 2016 Jan 18

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 2.2.8
Fix Version/s: None

Type: Incident report Priority: Critical
Reporter: Alexey Pustovalov Assignee: Unassigned
Resolution: Duplicate Votes: 13
Labels: dbsyncer, history, log, performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File #1.png     PNG File #2.png     PNG File #3.png     PNG File #4.png     PNG File #5.png     Text File before.log     Text File cloned.log    
Issue Links:
Duplicate
duplicates ZBXNEXT-3071 History cache optimization Closed
is duplicated by ZBX-7725 Single history syncer process stuck i... Closed

 Description   

If an item receives a lot of values per short time period, Zabbix history syncers can not process data in time. At the end history cache is full and Zabbix does not receive new value and does not process new items.



 Comments   
Comment by Aleksandrs Saveljevs [ 2015 Jan 09 ]

This issue seems to be the same as ZBX-6353. Apparently, the implemented solution is not efficient enough.

Comment by Christian Reizlein [ 2015 Mar 30 ]

I have experienced this issue several times, specially when a item of LOG monitor type start sending a lot of values, ie, more than 5000, zabbix get stuck trying to save those values to the database and doesn't store the value items cache until finish, creating and triggering lots of no data received for 5 minutes triggers.
IMHO there should be some real parallelism when storing those values, so one cache doesn't affect the others.

Comment by Filipe Paternot [ 2015 Apr 15 ]

Yes, about the same problem here, a little bit different though.
Currently there is not many items with such low check interval (lowest is 60 secs), but i do have a few items and proxies.

What we've managed to diagnose:

  • Even with 16 db syncers, only one seems to work at times:
    zabbix   13403 13107  2 14:25 ?        00:00:55 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #1 [synced 0 items in 0.000133 sec, idle 5 sec]
    zabbix   13404 13107  2 14:25 ?        00:00:49 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #2 [synced 0 items in 0.000099 sec, idle 5 sec]
    zabbix   13405 13107  8 14:25 ?        00:02:50 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #3 [synced 0 items in 0.000076 sec, idle 5 sec]
    zabbix   13406 13107  7 14:25 ?        00:02:38 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #4 [synced 0 items in 0.000106 sec, idle 5 sec]
    zabbix   13407 13107 12 14:25 ?        00:04:26 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #5 [synced 0 items in 0.000208 sec, idle 5 sec]
    zabbix   13408 13107 21 14:25 ?        00:07:27 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #6 [synced 0 items in 0.000130 sec, idle 5 sec]
    zabbix   13409 13107 13 14:25 ?        00:04:46 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #7 [synced 19040 items in 57.746061 sec, syncing history]
    zabbix   13410 13107  2 14:25 ?        00:00:53 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #8 [synced 0 items in 0.000104 sec, idle 5 sec]
    zabbix   13411 13107  2 14:25 ?        00:00:57 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #9 [synced 0 items in 0.000113 sec, idle 5 sec]
    zabbix   13412 13107  2 14:25 ?        00:00:52 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #10 [synced 0 items in 0.000086 sec, idle 5 sec]
    zabbix   13413 13107  2 14:25 ?        00:00:55 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #11 [synced 0 items in 0.000094 sec, idle 5 sec]
    zabbix   13414 13107  4 14:25 ?        00:01:23 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #12 [synced 0 items in 0.000094 sec, idle 5 sec]
    zabbix   13416 13107  2 14:25 ?        00:00:46 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #13 [synced 0 items in 0.000090 sec, idle 5 sec]
    zabbix   13417 13107  2 14:25 ?        00:00:49 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #14 [synced 0 items in 0.000081 sec, idle 5 sec]
    zabbix   13418 13107 18 14:25 ?        00:06:27 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #15 [synced 0 items in 0.000100 sec, idle 5 sec]
    zabbix   13419 13107  4 14:25 ?        00:01:24 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #16 [synced 0 items in 0.000223 sec, idle 5 sec]
    
  • ... and yet sometimes all/many db syncers work at the same time.
    zabbix   13400 13107  0 14:25 ?        00:00:00 /opt/projetos/zabbix/server/sbin/zabbix_server: housekeeper [removing old history and trends]
    zabbix   13403 13107  2 14:25 ?        00:01:18 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #1 [synced 3483 items in 13.558991 sec, idle 5 sec]
    zabbix   13404 13107  2 14:25 ?        00:01:18 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #2 [synced 0 items in 0.000243 sec, syncing history]
    zabbix   13405 13107  7 14:25 ?        00:03:21 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #3 [synced 0 items in 0.000365 sec, syncing history]
    zabbix   13406 13107  6 14:25 ?        00:03:01 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #4 [synced 1024 items in 2.256199 sec, syncing histor ]
    zabbix   13407 13107 10 14:25 ?        00:04:49 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #5 [synced 4333 items in 13.343527 sec, idle 5 sec]
    zabbix   13408 13107 17 14:25 ?        00:07:58 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #6 [synced 373 items in 1.321960 sec, syncing history]
    zabbix   13409 13107 14 14:25 ?        00:06:33 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #7 [synced 0 items in 0.000649 sec, syncing history]
    zabbix   13410 13107  2 14:25 ?        00:01:22 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #8 [synced 362 items in 1.318714 sec, syncing history]
    zabbix   13411 13107  2 14:25 ?        00:01:17 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #9 [synced 4982 items in 13.407591 sec, idle 5 sec]
    zabbix   13412 13107  2 14:25 ?        00:01:16 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #10 [synced 8348 items in 22.643176 sec, syncing history]
    zabbix   13413 13107  5 14:25 ?        00:02:45 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #11 [synced 840 items in 2.295558 sec, syncing histor ]
    zabbix   13414 13107  8 14:25 ?        00:04:03 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #12 [synced 0 items in 0.000173 sec, syncing history]
    zabbix   13416 13107  2 14:25 ?        00:01:12 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #13 [synced 0 items in 0.000133 sec, syncing history]
    zabbix   13417 13107  2 14:25 ?        00:01:17 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #14 [synced 4812 items in 13.554075 sec, idle 5 sec]
    zabbix   13418 13107 14 14:25 ?        00:06:53 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #15 [synced 8629 items in 22.630109 sec, syncing history]
    zabbix   13419 13107 12 14:25 ?        00:05:35 /opt/projetos/zabbix/server/sbin/zabbix_server: history syncer #16 [synced 32488 items in 261.816248 sec, syncing history]
    

The solution we managed to come up with was to split the hot data from the cold one. We created a new pair of mysql servers that keep the trends and history longer.

The env before the issue (all data: 7 days history, 15 months of cold data/trends):

  • usual nvps: 1.5k
  • max nvps (during recovery of db/zabbix server): 30k
  • db: 8 SAS 10k disks (raid 10)
  • db size: ~400G
  • db ram: 64G
  • history: 7 days, trends 15 months (partitions each month)
  • zabbix history cache: 2G
  • db syncers: 16
  • proxy pollers: 5
  • proxies: 18
  • items: 480k
  • triggers: 150k
  • initially, zabbix 2.4.3, during the problem, upgraded to 2.4.4

The CLONED env (fresh history/trends tables):

  • usual nvps: 1.5k
  • max nvps (during recovery of db/zabbix server): 7k
  • db: 6 SAS 10k disks (raid 10)
  • db size: ~20G
  • db ram: 32G
  • zabbix history cache: 2G
  • db syncers: 2
  • proxy pollers: 3
  • proxies: 18
  • items: 480k
  • triggers: 150k
  • zabbix 2.4.4

The env AFTER the issue (all data: 7 days history, 1 month of cold data/trends):

  • usual nvps: 1.5k
  • max nvps (during recovery of db/zabbix server): 30k
  • db: 6 SAS 15k disks (raid 10)
  • db size: ~80G
  • db ram: 64G
  • history: 7+1 days (partitions for each day), trends 1+1 months (partitions each month)
  • zabbix history cache: 2G
  • db syncers: 6
  • proxy pollers: 3
  • proxies: 18
  • items: 480k
  • triggers: 150k
  • zabbix 2.4.4

I've attached the before.log and cloned.log, they both show strace, ltrace, lsof and few slices of zabbix_server.log.
Before.log shows the environment that had mysql with 400G, and cloned.log show a smaller machine

Comment by Oleksii Zagorskyi [ 2015 Apr 16 ]

Filipe, your detailed stat is impressing, thanks!
But after reading I didn't really get what is the key change (regarding db syncers behavior) and what is the key improvement after the change?

Comment by Filipe Paternot [ 2015 Apr 16 ]

Hello Oleksiy,
sorry i wasn't clear as i wished..

The initial problem was that i had a huge queue on Zabbix Server (proxies were delivering to server fast enough, as usual), and also we had thousands of triggers of nodata firing. This was were we started our diagnose.

Our first step was to look at the DB. It was just fine. We had no queries running for more than 30 secs, and those that were the longest, were from frontend (tr_status.php). It has 24 processors, and load was just the usual, about 3.0. One funny thing we noticed afterwards, was that History Cache started to fill gradually about 36 hours before the issue (i added now two screenshots of this).

We found that most queries that zabbix_server ran at that time were from this block of code (dbcache.c):

	if (0 != ids_num)
	{
		sql_offset = 0;
		zbx_snprintf_alloc(&sql, &sql_alloc, &sql_offset,
				"select distinct itemid"
				" from %s"
				" where clock>=%d and",
				table_name, clock);

		DBadd_condition_alloc(&sql, &sql_alloc, &sql_offset, "itemid", ids, ids_num);

		result = DBselect("%s", sql);

Now, back to your question:

  • The dbsyncers started to delay, not sure exactly why, but apparently it has relation to history/trends size.
  • After the change (reducing the trends and db size by consequence), the db syncers started to work again, as usual.
  • The key change (for both, after the change and just the cloned database) was that the db size (greatly impacted by history/trends tables) was reduced and db syncer worked again.

This (including the query it runs on trends and trends_uint) led us to conclude that for some reason db syncer sometimes needs info from trends. As we reduced drastically the size of it, the queries started to run fast enough again.

Comment by Oleksii Zagorskyi [ 2015 Apr 16 ]

Just in case - those two screenshots are identical, probably by mistake.
To analyze such cases need to see also NVPS and db syncers load for the same period of time.

On the screenshot, looking to a history cache "saw", it looks like strong 5 minutes interval peaks.
So it doesn't look like filled, quote, "gradually".
It drops to ~90% momently but then recovers to 100% free very fast in 1-2 minutes (zabbix history cache: 2G - was mentioned earlier).

It could be for example some log* item(s) with update interval 5 minutes.
But it would be expected to see similar "saw" for text cache item.
So not everything fits in expected picture as for log* monitoring.

What is "DataSenderFrequency" for proxies ?
Default 1 second ?

Of course historical data size (history first of all) may have impact on db syncers as they work with history to calculate triggers and may ask database for history,
through a value cache, though.
Trends size - I don't think they may have impact on syncers at all.

Note that there are improvements regarding ValueCache ZBX-9470, ZBXNEXT-2462 for zabbix 3.0
Not sure how that could be related.

The code block you mentioned is from "DCflush_trends" function (Purpose: flush trend to the database).
Just in case - as I know very first trends calculation (after server start) is more hard task than all next ones (mentioned in ZBX-8557).
I don't see how it can be related to db syncers behavior (current issue) at all.

Comment by Filipe Paternot [ 2015 Apr 16 ]

My bad, just uploaded the correct #2, a day after the cache starts that behaviour.

Since we didnt collect those screenshots of nvps and internal process busy, we can only have them on trends. I attached them here as well. Note that some data was lost during restarts and other attempts to restore the performance.

Our proxies work in passive mode, not active, so we have ProxyDataFrequency=1 (about the same behavior, only limited by StartProxyPollers=3).

In this case, we observed that history* tables did not impact so much on db syncer, as we have as much data now as we had then, and the problem does not occur (actually it seems a lot faster). In pratical terms, we only reduced the size of trends (and partitioned history* by day).

Regarding ZBX-9470, i guess i'd need to test that and check how that could help, but not sure by how much. There is the screenshot #5 that shows we have way more hits than misses on Value Cache.

During 100% history syncer process, we saw mostly that query from "DCflush_trends" (select distinct itemid from trends/trends_uint where...) hitting the database, and no insert on history nor trends. So, you are probabily right about it beeing heavier when zabbix_server is (re)started. I just still wonder why we need to perform select's on trends to insert history and why it locks the dbsyncer (or it locks the valuecache?).

Comment by Oleksii Zagorskyi [ 2015 May 12 ]

Current server behavior can cause events creation like discussed in ZBX-9432.
In short: flood of a lot of values for 3 log* items blocked history write cache for 30 minutes and caused triggering of a lot of not related “XX is unreachable for 10 minutes” triggers (agent.ping items).
Those "XX is unreachable for 10 minutes” triggers created 2-3 consecutive OK/PROBLEM events, even with 1 or 2 minutes interval between the same events type.

Comment by Christian Reizlein [ 2015 May 12 ]

That is exactly the problem we have, and sometimes it goes beyond 30 minutes, depending on the size of the flood, as we use proxies to get all our data, all the data sticks lagged on the proxy and all items have missing data for 30 minutes or more, causing a big havoc on all the triggers.

As a remediation, or sort of prevention until this is truly fixed ( and data is sent/processed in parallel ) i had to do an external script that checks the pending records in the proxy and if they start pilling up just clean the offending log entries and send an alert.

Comment by Filipe Paternot [ 2015 May 12 ]

Christian, can you elaborate on that, or share the script?

Oleksiy, i'm not sure what you mean by "3 log* items". Three x log x items?

In out current env, we have pretty little (negligible) number of items of log and a bunch of unsigned. Of all triggers we saw fired up, all of them had the nodata() function in its expression, so we'd guess it has something to do with the History syncer not coping with necessary speed. Also, the events were fired up just fine, in line with this behavior.

Comment by Christian Reizlein [ 2015 May 12 ]

Filipe, i made a simple PHP script that just send me an email, but you can go from there, the important part is the QUERY on the Proxy database, and, as i use sqlite on those, below is the query I run:

SQL Query
SELECT count(*) As Registros, itemid FROM proxy_history WHERE proxy_history.id>(SELECT nextid FROM ids WHERE field_name='history_lastid') GROUP BY itemid HAVING Registros > 500 ORDER BY 1 DESC LIMIT 50;
// Note the > 500 , that will list items with more than 500 records left behind on the local database, so you can act from there
// Of course you can change that 500 to anything you wish.

Hope that helps, i can attach the whole script if you prefer.

Comment by Oleksii Zagorskyi [ 2015 May 12 ]

Filipe, yes, I meant three items for log monitoring (log, logrt or eventlog - doesn't matter).

Here is an example SQL to check how many values zabbix server got for an item in a period, applicable for zabbix-server DB only:

select count(*), itemid from history_log WHERE clock between unix_timestamp('2015-05-09 06:00') and unix_timestamp('2015-05-09 07:00') GROUP BY itemid ORDER BY count(*) DESC LIMIT 20;

This way you can check to make sure you indeed had single (or several) log items flood for a short period of time.

Comment by Oleksii Zagorskyi [ 2015 May 21 ]

Current issue was mentioned in ZBXNEXT-2823 - troubles with bulk data sending by zabbix_sender.

Comment by Oleksii Zagorskyi [ 2015 May 29 ]

I'm not very sure it should help well, but still ...

I think it would be a good approach for all zabbix instances (which may be affected by current issue) to limit active agents to not send many values for single items and fast.
Of course it would not help for "zabbix trapper" items which get textual values.

Not many users understand what item update interval means and many used it as 60-300 seconds, which of course not very good.
Yes, higher update interval may provide some traffic economy, but that's a very tiny benefit, I'd ignore it at all.
So for the update interval for %log% items I'd recommend 2-5 seconds.
Not sure about increased CPU usage on agent host, but I don't think this is an issue on modern hardware, so just keep it in mind.
But with smaller update interval log* records will be delivered to server faster - that's very important for monitoring! (IMO)

Next are parameters for agent config: BufferSend, BufferSize, MaxLinesPerSecond (may be overwritten in agent key param)
https://www.zabbix.com/documentation/2.4/manual/appendix/config/zabbix_agentd
As avoiding current issue is the main goal - I ignore all possible minor drawbacks (mainly a bit increased traffic) of decreasing these 3 parameters.

I suggest to set BufferSend=1, BufferSize=2 (1-5), MaxLinesPerSecond=3 (1-5)
They should not affect normal logs monitoring at all. I mean logs which sometimes store couple of lines in a log.
But I suppose they should noticeably do not allow to appear current issue, which many times happened already for zabbix users as disaster because zabbix server was not functioning at all for many hours.
I suppose that frequent and small data portions (with a few values of a single items) will let history syncers to process data flow much better. Health of zabbix server will be only better.

Bad logs (which may store 10-100 lines every seconds for long time) is an unpredictable thing and usually do not worth to be delivered very fast to zabbix server at all.

Correct me please if I'm wrong.

Comment by Marc [ 2015 May 29 ]

ZBXNEXT-1843 and ZBXNEXT-2430 might be distantly related.

Comment by Alexander Vladishev [ 2015 Sep 16 ]

Related issue: ZBXNEXT-2960

Comment by Oleksii Zagorskyi [ 2015 Sep 16 ]

Not related, but when working on this issue take a look to ZBXNEXT-2617 please.
Maybe it could be taken into account.

Comment by Marc [ 2015 Nov 26 ]

Can it be excluded to be related to ZBXNEXT-3051?

Comment by Andris Zeila [ 2015 Dec 14 ]

This issue should be solved by history cache redesign in ZBXNEXT-3071

Comment by Oleksii Zagorskyi [ 2016 Jan 18 ]

Andris, can we close this one as duplicate?

Comment by Andris Zeila [ 2016 Jan 18 ]

Looks like

Comment by Oleksii Zagorskyi [ 2016 Jan 18 ]

Closed as duplicate of ZBXNEXT-3071 (i.e. fixed for version 3.0)

Generated at Tue May 13 08:30:42 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.