Loading...

Type: Problem report
Resolution: Fixed
Priority: Major
Fix Version/s: 6.4.3rc1, 7.0.0alpha1, 7.0 (plan)
Affects Version/s: 6.4.0
Component/s: Server (S)
Labels:
- performance
Environment:

Hide

DB: (PostgreSQL) 14.5 (Ubuntu 14.5-2.pgdg20.04+2) + timescaledb 2.7.1
32 CPU, 80 GB RAM

Server: HA Zabbix two nodes cluster - 6.4.0 8 CPU, 80 GB RAM

Number of hosts (enabled) - 29266
Number of items (enabled/disabled/not supported) - 11286986 / 40027 / 13672
Number of triggers (enabled/disabled) - 10272797 / 258826
Required server performance, new values per second - 22523.16

Everything is monitored by 14 active zabbix proxies.
95% of items have logrt.count key

Show
DB: (PostgreSQL) 14.5 (Ubuntu 14.5-2.pgdg20.04+2) + timescaledb 2.7.1 32 CPU, 80 GB RAM Server: HA Zabbix two nodes cluster - 6.4.0 8 CPU, 80 GB RAM Number of hosts (enabled) - 29266 Number of items (enabled/disabled/not supported) - 11286986 / 40027 / 13672 Number of triggers (enabled/disabled) - 10272797 / 258826 Required server performance, new values per second - 22523.16 Everything is monitored by 14 active zabbix proxies. 95% of items have logrt.count key

Sprint:
Sprint 99 (Apr 2023), Sprint 100 (May 2023)
Story Points:
0.5

Steps to reproduce:

We have very large environment with lot of problems. Most of them was created and solved in small period of time. Example: number of active triggers:

SELECT count(*) FROM problem p WHERE p.source='0' AND p.object='0' AND NOT EXISTS (SELECT NULL FROM event_suppress es WHERE es.eventid=p.eventid) AND p.cause_eventid IS  NULL AND p.r_eventid IS NULL;                           count
-------
 22627

Number of rows in 'problem' table:

SELECT count(*) FROM problem ;
 count
--------
 469844

Size this table on disk is around 85MB.

Once per hour we see utilization of housekeeper like 100% for 10-15 minutes. After investigation we see something strange:

 housekeeper [deleted 0 hist/trends, 0 items/triggers, 0 events, 0 sessions, 0 alarms, 0 audit items, 0 records in 713.206790 sec, idle for
 1 hour(s)]

For now there is nothing to delete, because it's new environment and not meet storage period yet. But is still quite bit long.

In config:
HousekeepingFrequency=1
MaxHousekeeperDelete=10000 (our test value; change that to larger/smaller value nothing change in this kind of problem)

Result:

Increased log_level on housekeeper. Output in attachment. Interesting line:

1044522:20230413:124551.911 slow query: 337.193469 sec, "select p1.eventid from problem p1 where p1.r_clock<>0 and p1.r_clock<1681296014 and p1.eventid not in ( select cause_eventid from problem p2 where p1.eventid=p2.cause_eventid) limit 10000"

Check explain on database:

zabbix=# explain analyze select p1.eventid from problem p1 where p1.r_clock<>0 and p1.r_clock<1681396014 and p1.eventid not in ( select cause_eventid from problem p2 where p1.eventid=p2.cause_eventid) limit 10000;
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..174369641.30 rows=10000 width=8) (actual time=185.701..323153.930 rows=10000 loops=1)
   ->  Index Scan using problem_2 on problem p1  (cost=0.42..1507512730.64 rows=86455 width=8) (actual time=99.502..323057.884 rows=10000 loops=1)
         Index Cond: (r_clock < 1681396014)
         Filter: ((r_clock <> 0) AND (NOT (SubPlan 1)))
         Rows Removed by Filter: 21924
         SubPlan 1
           ->  Seq Scan on problem p2  (cost=0.00..16617.11 rows=1 width=8) (actual time=32.290..32.290 rows=0 loops=10000)
                 Filter: (p1.eventid = cause_eventid)
                 Rows Removed by Filter: 464175
 Planning Time: 0.314 ms
 JIT:
   Functions: 15
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 2.016 ms, Inlining 9.124 ms, Optimization 50.737 ms, Emission 26.105 ms, Total 87.981 ms
 Execution Time: 323162.612 ms
(15 rows)

After more research We check something interesting. The same query without (i thing) checking cause and symptom

explain analyze select p1.eventid from problem p1 where p1.r_clock<>0 and p1.r_clock<1681396014 limit 10000;
                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..920.69 rows=10000 width=8) (actual time=61.017..67.360 rows=10000 loops=1)
   ->  Index Scan using problem_2 on problem p1  (cost=0.42..15912.80 rows=172911 width=8) (actual time=61.015..66.868 rows=10000 loops=1)
         Index Cond: (r_clock < 1681396014)
         Filter: (r_clock <> 0)
         Rows Removed by Filter: 21914
 Planning Time: 0.155 ms
 Execution Time: 67.626 ms
(7 rows)

We don't use this feature yet, so column 'cause_eventid' is empty.
Expected:
Maybe some change this query or something is 'problem' tables structure, so this query will run faster.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

housekeeper_log.txt
25 kB
2023 Apr 14 11:39
slow_query.log
487 kB
2023 Jun 02 12:54
ZBX-22675.diff
0.6 kB
2023 Apr 14 16:43

caused by

ZBXNEXT-7964 Cause and symptom events

Closed

depends on

ZBX-22889 Performance problems due to missing problem.cause_eventid index on PostgreSQL

Closed

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates