[ZBX-23690] Utilization of history syncer processes 100% when creating and updating many maintenances Created: 2023 Nov 10  Updated: 2024 Oct 04  Resolved: 2024 Jun 06

Status: Closed
Project: ZABBIX BUGS AND ISSUES
Component/s: Server (S)
Affects Version/s: 6.4.7
Fix Version/s: 6.4.16rc1, 7.0.1rc1, 7.2.0alpha1

Type: Problem report Priority: Critical
Reporter: Yurii Polenok Assignee: Andris Zeila
Resolution: Fixed Votes: 2
Labels: maintenance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

PostgreSQL 14.5
TimescaleDB 2.7.2


Attachments: File ZBX-23690-6.4-1 1.diff     PNG File image-2023-11-10-00-35-15-020.png     PNG File image-2023-11-10-00-35-33-568.png     PNG File image-2023-11-10-00-36-02-102.png     File optimize_host_maintenance_index.diff     File optimize_host_maintenance_index_prof-2.diff     File replace_shared_table_lock_with_row_lock_for_update.diff     Text File zabbix_server (1).log    
Issue Links:
Duplicate
Team: Team A
Sprint: S24-W10/11, S24-W12/13, S24-W14/15, S24-W16/17, S24-W18/19, S24-W20/21, S24-W22/23
Story Points: 2

 Description   

Steps to reproduce:
Create/update maintenances 700+ times during 30 minutes.
An example of such a maintenance:

The hosts group is the parent group for all other groups. Contains 28000+ hosts.
The tag stands for the hostname. We cannot use maintenance for regular Zabbix hosts, as we have events not directly related to Zabbix hosts. They alert from "virtual" hosts. Events from such hosts can be associated with different virtual machines, e.g.
How it works:

An external system handles product deployment. When it's time to deploy a new version of a product to a virtual machine, it creates maintenance via the Zabbix API. When the product is updated and running, this system sends maintenance updates by changing the "Active till" and "Maintenance period length" to "now+5 min" for example.

If need to update several hundreds of VMs one by one in a short period of time (30-60 minutes), this leads to a history syncer problem. Zabbix then cannot even process new data from the proxies and is effectively down during and for some time after the service ends.
At this time, there is no heavy load on the operating system, database or database disks.
Only slow queries in Zabbix server logs:
2771378:20231109:063027.178 slow query: 3.015999 sec, "lock table maintenances in share mode"
2771381:20231109:063709.894 slow query: 32.688270 sec, "lock table maintenances in share mode"

Of course, we try to do multi-host maintenance when possible, so as not to spam Zabbix with a huge number of separate maintenances, but unfortunately, this is not always possible.

StartDBSyncers=64
Result:
!
!
Expected:
Creating and updating maintenance should not greatly affect the history syncer and other processes.
~23 maintenance changes per minute doesn't seem like that many.



 Comments   
Comment by Andrei Gushchin (Inactive) [ 2023 Nov 13 ]

Hello Yurii,

How often do you do maintenance update?
Does that slowness related to some slow queries in the log?
how to reproduce it accurately? How many hosts/items you have and which goes to maintenance and which conditions with or without data collection?

Best regards,
Andrei

Comment by Yurii Polenok [ 2023 Nov 13 ]

Hello,

Such maintenances occur almost every weekday. But apply to a different number of hosts.
Each maintenance is created for one or more VMs, the product is deployed to the VM usually in 2-10 minutes, then this maintenance is updated with the new end time because already everything should be up and running and we should monitor the services as usual. End time is now+5 minutes to be sure all Zabbix events are resolved.

Only slow queries in Zabbix server logs:
2771378:20231109:063027.178 slow query: 3.015999 sec, "lock table maintenances in share mode"
2771381:20231109:063709.894 slow query: 32.688270 sec, "lock table maintenances in share mode"

Number of hosts (enabled/disabled) 28925 28463 / 462
Number of items (enabled/disabled/not supported) 754250 723110 / 31019 / 121

Example of maintenance you may find on screenshoot in this Issue.
Maintenance type "With data collection". "No data collection" doesn't allow tags and doesn't work for us because we create maintenance based on a one parent host group.
You can try to reproduce this by creating and then updating 350+ maintenances within 30 minutes for different hosts. Use one large parent host group that contains all hosts and one or two tags associated with a unique Zabbix host. For example you have hosts test1.com and test2.com then tags will be:
tag name "hostname", tag value "test1.com" and
tag name "hostname", tag value "test2.com"

Comment by Vladislavs Sokurenko [ 2023 Dec 15 ]

Could you please be so kind to provide prof_enable output with ZBX-23690-6.4-1 1.diff

Are there many events with tags ?

yuriip ?

Comment by Vladislavs Sokurenko [ 2023 Dec 15 ]

Most likely problem with matching each event to many maintenances using tags for each host. It should be checked if can be optimised.
For example:
If we have simple maintenance equal tag as described in issue then put those maintenances in hashset do hashset search for each event to maintenances instead of iterating through all hosts for each event.

Comment by Yurii Polenok [ 2024 Jan 03 ]

Test package installed in dev env. If it works without problems it will be installed in prod on Monday.

Each event has tags.
Unfortunately, we now have a lot of test/technical Not classified events (about 700) and about 700 real events.
Most events are correlated, but as I understand for maintenance all events, not only cause, matter.

Comment by Andris Zeila [ 2024 Jan 08 ]

Could you please check attached patch replace_shared_table_lock_with_row_lock_for_update.diff - it should help in scenario with many host specific maintenances.

Comment by Andris Zeila [ 2024 Jan 15 ]

The optimize_host_maintenance_index.diff reduces configuration cache locking time when processing maintenances (need to have 100+ maintenances over thousands of hosts matched by tags to have any real effect).

Comment by Vladislavs Sokurenko [ 2024 Jan 26 ]

Could you please be so kind and provide profiling information (prof_enable) for history syncers with optimize_host_maintenance_index_prof-2.diff it contains more profiling that is needed to identify bottleneck.

Comment by Yurii Polenok [ 2024 Mar 15 ]

Hello,
We can state that after installing the last provided package, the situation with the load improved, there was no threshold exceeded. The other day there was quite a large amount of maintenance, the load was much higher than usual, but still did not reach the threshold value. I hope it was not just a coincidence and the problem is really solved.
We hope to see this fix in version 6.4.13.
Thank you!

Comment by Andris Zeila [ 2024 May 24 ]

Released ZBX-23690 in:

  • pre-6.4.16rc1 be05dc3d77e
  • pre-7.0.1rc1 320bb5bee26
  • pre-7.2.0alpha1 38ed95090b3
Generated at Mon Jun 02 17:32:15 EEST 2025 using Jira 9.12.4#9120004-sha1:625303b708afdb767e17cb2838290c41888e9ff0.