[ZBX-19204] Large trend cache breaks history sync Created: 2021 Apr 02  Updated: 2025 Mar 04  Resolved: 2024 Feb 01

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Problem report Priority: Major
Reporter: Aaron Whiteman Assignee: Vladislavs Sokurenko
Resolution: Duplicate Votes: 20
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Causes
Duplicate
duplicates ZBX-23064 Problem with restart server - syncing... Closed
Sub-task
part of ZBX-25651 Add an opportunity to disable trend s... Closed

 Description   

Environment:

**Zabbix Server 5.0.10, CentOS linux, 2 cores, 16Gb RAM

Dedicated Db: Posgresql 12, timescaledb 2, 1 month shards as per recommendation of Zabbix (see ZBX-16347), 8 cores, 125Gb RAM. Database is approximately 1.7T on disk storage.

Dedicated Front End server, Zabbix 5.0.10, nginx.

Server monitors minimal items (primarily itself and the database), all primary monitoring performed by a set of 9 proxies.

Steps to reproduce:

  1. Deploy Zabbix to a relatively large environment. Our Zabbix implementation is 1126208 items, with a NPS of just over 2600. We see approximately 23142 "trend" data elements per hour, and approximately 750000 "trend_uint" data elements per hour (with an overall average of approximately 45 history data points per item). 
  2. Wait for trends to be flushed (half-past the hour is a good time)
  3. Shut down Zabbix. This will be slow because it will write out the current-hour trends.
  4. Verify that you have trend data associated with the partial hour
    1. select 'trends',count(itemid),sum(num),TO_TIMESTAMP(clock),clock
      from trends
      where clock >= extract(epoch from now() - INTERVAL '4 HOUR')::INTEGER
      group by clock
      UNION
      select 'trends_uint',count(itemid),sum(num),TO_TIMESTAMP(clock),clock
      from trends_uint
      where clock >= extract(epoch from now() - INTERVAL '4 HOUR')::INTEGER
      group by clock
      order by 1,4 desc;
  5. Start Zabbix.
  6. Wait for the top of the next hour
  7. Watch your Zabbix implementation stop inserting history data.
  8. re run the query above, note how the sum value slowly increases as trends are updated

Result:
**

**Zabbix will use all history syncs to store trend data until the trend data is flushed .

For particularly large databases, using the recommended timescale settings of one shard per month (see ZBX-16347), the select() query to determine if an item is already in the database takes 6 seconds to return, for each call. The individual update queries can also be significant (exceeding 60 seconds). The end result is that the synced can take several minutes (last night, I gave up at 35 minutes and forcibly killed the Zabbix server, judging that it is better to lose trends than more history). The 6PM trend write had not yet completed, and my estimate was that it was only one-third complete.

Expected:
Trend writing does not impact history writing

 

 

Possible solutions (there are likely others, but these are possibilities):

  • Use a dedicated trend writing daemon, rather than the history syncer, or limit trend writes to a subset of the syncer processes available.
  • Use prepare/exec nomenclature so the SQL Server only needs to prepare the query once (Note: this would require you to stop stop sending multiple statements in a single query). Use UPSERT nomenclature to eliminate select() query to check for existing trend data.
  • Make trend inserts smaller and use a queue model so that history data is not blocked by large select/insert pain during the trend transaction


 Comments   
Comment by Tomáš Heřmánek [ 2021 May 18 ]

Hi, same issue with zabbix 5.2. This is improved in zabbix 5.4? 

https://support.zabbix.com/browse/ZBXNEXT-782

https://support.zabbix.com/browse/ZBXNEXT-6280

https://support.zabbix.com/browse/ZBXNEXT-6503

Thank you,

Tom

Comment by Aaron Whiteman [ 2021 Aug 06 ]

Since I updated the OS on my zabbix-server last night, I got to experience this joy again.

 

Currently, the only thing I can do is wait for zabbix to go catatonic, then kill -9 the service to flush the trend cache, and start again. This is certainly not an expected feature of a monitoring tool that claims to be enterprise class.

 

For reference, I have 6 db syncers (much testing shows this is the optimal count for "most of the time") to support 10 proxies, monitoring 7308 enabled hosts (113546 enabled items), with a required performance of 2589.51. Those syncers can push 70000 items in about 10 seconds, EACH when the server comes online, as long as there's no trends to sync.

 

But if I have to sync existing trend data, performance falls to about 700 synced items per 70 seconds. That's so slow that the syncers have no possible way to catch up.

 

Unfortunately, the three issues resolved in 9.4 above don't really help this case, because the database continues to have excellent performance for other queries, and the issue is the bottleneck in how the history syncers specifically handle trend writes.

Another option may to stop using the trend cache at all; the following is an example of how I populated trends for a one hour period after I had to restart the database. (pgsql 13, with timescaledb). I use the ON CONFLICT function here because zabbix may have partially written data that I need to replace.

insert into trends
select
itemid
,clock - (clock % 3600) as clock
,count( * ) as num
,min(value) as value_min
,avg(value) as value_avg
,max(value) as value_max
from history
where
history.clock >= (extract(epoch from '2021-08-06 06:00:00-07'::timestamptz))::bigint
and history.clock < (extract(epoch from '2021-08-06 07:00:00-07'::timestamptz))::bigint
group by itemid,2
ON CONFLICT (itemid,clock)
DO
UPDATE SET num=EXCLUDED.num,value_min=EXCLUDED.value_min,value_avg=EXCLUDED.value_avg,value_max=EXCLUDED.value_max;

insert into trends_uint
select
itemid
,clock - (clock % 3600) as clock
,count( * ) as num
,min(value) as value_min
,avg(value) as value_avg
,max(value) as value_max
from history_uint
where
history_uint.clock >= (extract(epoch from '2021-08-06 06:00:00-07'::timestamptz))::bigint
and history_uint.clock < (extract(epoch from '2021-08-06 07:00:00-07'::timestamptz))::bigint
group by itemid,2
ON CONFLICT (itemid,clock)
DO
UPDATE SET num=EXCLUDED.num,value_min=EXCLUDED.value_min,value_avg=EXCLUDED.value_avg,value_max=EXCLUDED.value_max;

Comment by bunkzilla [ 2021 Dec 10 ]

having similar issue in 5.4. Brought ingestion do a crawl, set off a trigger storm due to triggers setup for NODATA. I heard there was perhaps some magic about when one can stop the zabbix and then start it. But ideally I'd like to be able to restart zabbix at any time and not have the top of the hour cripple everything and cause false triggers.

Comment by GOID [ 2023 Apr 11 ]

same behavior on Zabbix 6 (PostgreSQL + TimeScaleDB)

and older version Zabbix 5 too.

Number of hosts (enabled/disabled) 1847 1724 / 123
Number of templates 323  
Number of items (enabled/disabled/not supported) 332317 303556 / 25964 / 2797
Number of triggers (enabled/disabled [problem/ok]) 297151 255789 / 41362 [990 / 254799]
Number of users (online) 155 20
Required server performance, new values per second 3177.43  

For now solution is let all queries done. Until that - monitoring overloaded by triggers with unreacheable hosts

Comment by Sergey [ 2023 May 25 ]

6.2.3 too

Comment by Tomáš Heřmánek [ 2023 Jun 27 ]

Maybe vso can give us some help or clue. I think Zabbix need to have first parallel writing history syncer. Definitely not easy task to solve.

Comment by GOID [ 2023 Jul 26 ]

nothing changed at 6.0.19 - after restart on begin of next hour "updates" takes 20minutes

Comment by Vladislavs Sokurenko [ 2024 Jan 17 ]

Could you please check if ZBX-23064 or ZBX-22126 help ?

Comment by Vladislavs Sokurenko [ 2024 Feb 01 ]

Issue was fixed under ZBX-23064, if issue still occurs then please feel free to reopen ticket.

Comment by Vladislavs Sokurenko [ 2025 Mar 03 ]

Please check if ZBX-25651 helps

Generated at Sat May 03 06:54:26 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.