[#ZBXNEXT-3089] history* primary keys

[ZBXNEXT-3089] history* primary keys Created: 2016 Jan 08 Updated: 2024 Apr 10 Resolved: 2021 Nov 16
Status:	Closed
Project:	ZABBIX FEATURE REQUESTS
Component/s:	Server (S)
Affects Version/s:	None
Fix Version/s:	None

Type:

Change Request

Priority:

Critical

Reporter:

Mathew

Assignee:

Unassigned

Resolution:

Fixed

Votes:

Labels:

db, index

Remaining Estimate:

Not Specified

Time Spent:

Not Specified

Original Estimate:

Not Specified

Issue Links:

Causes
causes	~~ZBXNEXT-6921~~	Use primary keys for historical table...	Closed
Duplicate
is duplicated by	ZBXNEXT-4212	Change zabbix history add DUPLICATE K...	Open
Sub-task
depends on	ZBXNEXT-2363	DB Schema for MariaDB with TokuDB	Open

Team:

Team A

Sprint:

Sprint 26

Description

A primary key in the database can be shown to have significant performance benifits, particularly in a database engine such as InnoDB or Tokudb which clusters the row based on this index.

However it is not currently possible to use a primary key in the history* tables due to the server occasionally attempting to insert the same row twice.

A simple protection against this is to add in zbx_db_insert_execute

#ifdef HAVE_MYSQL
	zbx_strcpy_alloc(&sql_command, &sql_command_alloc, &sql_command_offset, " ON DUPLICATE KEY UPDATE value=VALUE(value)");
#endif

This should have no (negative) effect on the current schema, however allows experts to modify their schema possibly with a primary key.

Comments

Comment by Mathew [ 2016 Jan 08 ]

Note, additional improvements could be made by:

Making this a parameter to the function for further control
Testing and support for other DB Engines

Comment by Aleksandrs Saveljevs [ 2016 Jan 08 ]

Could you please describe what primary key you would like to add, what benefit is it expected to provide, and when does the server occasionally attempt to insert the same row twice?

Comment by Mathew [ 2016 Jan 08 ]

We have modified our history* tables to have the following schema

CREATE TABLE `history` (
  `itemid` bigint(20) UNSIGNED NOT NULL,
  `clock` int(11) NOT NULL DEFAULT '0',
  `value` double(16,4) NOT NULL DEFAULT '0.0000',
  `ns` int(11) NOT NULL DEFAULT '0'
) ENGINE=TokuDB DEFAULT CHARSET=latin1
PARTITION BY RANGE (clock)
(
...
PARTITION p20160111 VALUES LESS THAN (1452470400)ENGINE=TokuDB
);
ALTER TABLE `history`
  ADD PRIMARY KEY (`itemid`,`clock`);

With this schema we are able to get roughly three times the performance we where getting. Disk read IOPS are significantly reduced, which is extremly important on the hardware we are using (networked SAN optimized for write once, read rarely workloads (with a SSD cache)). Partularly it now no longer takes ~2 minutes to render a complex graph for an item updating at 15seconds for 12/24hours.. The queries to render this graph now complete in 1-6 seconds (depending on SSD cache / age of data), or about 0.5 - 1s uncached per item in graph over a large span.

I am unsure when it happens, however I have observed (at rate of ~10/day on our current test server) errors like "Duplicate entry '42120-1452240997' for key 'PRIMARY' [insert into history_uint ..."

Given that the most frequent item is 1 per second this is obviously non-sensense, but regardless the correct behaviour would be to ignore the duplicate entry and insert the rest. The behaviour is however to abandon all values in the bulk due to the single or few duplicate values. either "INSERT IGNORE" or more correctly (since it updates the value with the later version) "INSERT ... ON DUPLICATE UPDATE").

Now this setup is likely not to be used by everyone, its our attempt to scale Zabbix up to ~1-2TB/60d historical data (size including trends for 2 yrs), while keeping the cost reasonable.

Comment by Aleksandrs Saveljevs [ 2016 Jan 08 ]

The "history" tables have "ns" field, which stands for "nanoseconds". So if two "clock"s are identical, the "ns" field is used to distinguish the timestamps. This is especially useful for log files. So ignoring all values with the same "clock" except one is incorrect.

In any case, partitioning is not officially supported by Zabbix, so you might wish to refer to https://www.zabbix.org/wiki/Getting_help for community help on that. It is proposed to close this request as "Won't fix".

Comment by Aleksandrs Saveljevs [ 2016 Jan 08 ]

Meanwhile, you might wish to vote on ZBXNEXT-806 or ZBXNEXT-714.

Comment by Mathew [ 2016 Jan 08 ]

a) Partitioning is irrelevant to the feature suggestion, the ticket is regarding PRIMARY KEY support. But FYI despite the lack of official support, I dont know anyone without running Zabbix without partitioning.
b) log files are not responsible for the duplicate primary key messages, we do not need or use that functionality (honestly there are better solutions out there for that)
c) ns is largely irrelevant to monitors as the lowest update frequency is 1/s, regardless this patch would be recommended still if ns was included in the PK (its not in our schema, its not needed for our usecase)

This patch ensures graceful handling of duplicate key errors. Granted these will not happen with the current schema. Its within the conceivable realm of modifications those in need to scale will perform, and there is little to no cost to ensuring graceful handling.

This is not a request to change the current schema, any one running Zabbix at scale will have their own customizations. Regardless of weather you provide either TokuDB (or any other high scale database) schema or Partitioning schema those with the need will either develop in house, or hire someone to do it for them.

Offtopic; We regularly collaborate with the two other companies with large Zabbix installations, both run Partitioning, one runs TokuDB and the other a very significant hardware investment. I also have a few individuals who I have helped out along the way, normally the first thing done is to disable housekeeper and enable Partitioning.

Comment by Aleksandrs Saveljevs [ 2016 Jan 08 ]

The current schema already has an index for "itemid, clock" in all history tables. Could you please describe the benefits of turning that (officially, together with "ns") into a primary key?

Comment by Mathew [ 2016 Jan 08 ]

Significantly reduced read IOPs leading to better performance on certain database engines. Leading to better read performance (write performance improvements unknown / not cared about in our case), probably improved as well (depending on clustering middle insert overheads).

Our benchmarks are with TokuDB, however I would expect the same (if not more) improvements with InnoDB which also clusters on its Primary Key in the engine. I am not aware of any improvements that would result in MyISAM or other DB platforms.

Clustered indexes store the row in with the index data saving a disk seek. Like a tree

[col1{m nodes}]->[col2{m nodes}]->[leaf node: row]

instead of

[col1{m nodes}]->[col2{m nodes}]->[leaf node: row* pointer]

Extremely simplified example (for the common query of format itemid=? AND clock < ? and clock > ?)

No PK: Lookup itemid: ? -> perform range query for clock between ? and ? -> foreach index result (pointer to row) -> seek to pointer to row (return)
With PK: Lookup itemid: ? -> perform range query for clock between ? and ? -> foreach index result, return

I hope that makes some sense, its easier to draw. Heres a stack overflow thats mostly correct - http://stackoverflow.com/questions/1251636/what-do-clustered-and-non-clustered-index-actually-mean

On a rotational disk (particularly a slow one) without a PK significantly more reads and seeks are performed. This is accurate for MySQL, I see no reason it wouldnt be for others.

Note: TokuDB specifically does support additional non-unique clustered indexes, we did however not see much improvement from using this feature (only a marginal improvement of ~5-10%). That engine specific however.

Comment by Stefan Priebe [ 2016 Oct 29 ]

Dear Mathew,

where did you add this:
#ifdef HAVE_MYSQL
zbx_strcpy_alloc(&sql_command, &sql_command_alloc, &sql_command_offset, " ON DUPLICATE KEY UPDATE value=VALUE(value)");
#endif

file src/libs/zbxdbhigh/db.c but which line in the function?

Comment by Stefan Priebe [ 2016 Oct 29 ]

I'm adapting this change but is this really everything you changed? How about the db overflow function? how did you ensure that the field value does exist? Why did you not update all fields and just value?

Comment by Andrey Denisov [ 2017 Apr 02 ]

First of all, Mathew, thank you very much! Your post is very useful!

Before switching to PK(itemid, clock) at history and history_uint tables we were waiting about 12 seconds to draw graphs for 1-3d from history for 10 items. Now it's about 0.5 sec for 10 items for 1-3 days.
Another good feature: we do not collect several values for same itemid and clock (second) anymore. It is rather mistake to get them in history and history_uint tables. That was the reason why our DB has grown twice bigger than it is now.

Patching C code is necessary for such PK! Otherwise you will get stalled triggers for random items, that is very annoying and discouraging. The reason is desynchronization between Zabbix Server cache and DB data as a result of exceptions of PK violations and discarding large portions of values of random items and triggers from being inserted/updated to DB.

My first version of patch is here. It's rather simple and could be improved I believe, but it works for me:
https://github.com/vagabondan/zabbix-patches/tree/master/zabbix-3.2/ZBXNEXT-3089

Reference:
We have Zabbix v3.2.4 instance with performance of 10knvps, DB size (after implementing changes) is 800 GB for 7 days raw history and about 1 year of trends. DB type Percona MySQL, v.5.7.17, engine InnoDB (=XtraDB in case of Percona).
Web: nginx+apache+php-fpm.

Comment by Mathew [ 2017 Apr 02 ]

Sorry for not getting back to you Stefan Priebe, The patch by Andrey adds it in the correct location.

We did not have any issues with the overflow function, it could just be our use case did not pass though those paths - or perhaps due to us using an older version. I don't see any problems with Andrey's patch myself - when we upgrade next I'll give it a shot (It's clear he put some effort into it +1).

We have had the odd (~3) stalled triggers over the years, but never any serious quantities. Always fixed with a restart. I didn't even think to associate it with this patch! It's very possible that it's the cause.

Other than that, we have been running this in production happily. Without this patch we would have needed to upgrade our servers again this year most likely, now it's still running at <10% CPU, DB size is almost 500GB at 7knvps. Graphs are near instant