Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  2. ZBX-19204

Large trend cache breaks history sync



    • Problem report
    • Resolution: Duplicate
    • Major
    • None
    • None
    • None
    • None



      **Zabbix Server 5.0.10, CentOS linux, 2 cores, 16Gb RAM

      Dedicated Db: Posgresql 12, timescaledb 2, 1 month shards as per recommendation of Zabbix (see ZBX-16347), 8 cores, 125Gb RAM. Database is approximately 1.7T on disk storage.

      Dedicated Front End server, Zabbix 5.0.10, nginx.

      Server monitors minimal items (primarily itself and the database), all primary monitoring performed by a set of 9 proxies.

      Steps to reproduce:

      1. Deploy Zabbix to a relatively large environment. Our Zabbix implementation is 1126208 items, with a NPS of just over 2600. We see approximately 23142 "trend" data elements per hour, and approximately 750000 "trend_uint" data elements per hour (with an overall average of approximately 45 history data points per item). 
      2. Wait for trends to be flushed (half-past the hour is a good time)
      3. Shut down Zabbix. This will be slow because it will write out the current-hour trends.
      4. Verify that you have trend data associated with the partial hour
        1. select 'trends',count(itemid),sum(num),TO_TIMESTAMP(clock),clock
          from trends
          where clock >= extract(epoch from now() - INTERVAL '4 HOUR')::INTEGER
          group by clock
          select 'trends_uint',count(itemid),sum(num),TO_TIMESTAMP(clock),clock
          from trends_uint
          where clock >= extract(epoch from now() - INTERVAL '4 HOUR')::INTEGER
          group by clock
          order by 1,4 desc;
      5. Start Zabbix.
      6. Wait for the top of the next hour
      7. Watch your Zabbix implementation stop inserting history data.
      8. re run the query above, note how the sum value slowly increases as trends are updated


      **Zabbix will use all history syncs to store trend data until the trend data is flushed .

      For particularly large databases, using the recommended timescale settings of one shard per month (see ZBX-16347), the select() query to determine if an item is already in the database takes 6 seconds to return, for each call. The individual update queries can also be significant (exceeding 60 seconds). The end result is that the synced can take several minutes (last night, I gave up at 35 minutes and forcibly killed the Zabbix server, judging that it is better to lose trends than more history). The 6PM trend write had not yet completed, and my estimate was that it was only one-third complete.

      Trend writing does not impact history writing



      Possible solutions (there are likely others, but these are possibilities):

      • Use a dedicated trend writing daemon, rather than the history syncer, or limit trend writes to a subset of the syncer processes available.
      • Use prepare/exec nomenclature so the SQL Server only needs to prepare the query once (Note: this would require you to stop stop sending multiple statements in a single query). Use UPSERT nomenclature to eliminate select() query to check for existing trend data.
      • Make trend inserts smaller and use a queue model so that history data is not blocked by large select/insert pain during the trend transaction


        Issue Links



              vso Vladislavs Sokurenko
              wsuzabbixapw Aaron Whiteman
              20 Vote for this issue
              25 Start watching this issue