-
Problem report
-
Resolution: Fixed
-
Major
-
6.4.0
-
DB: (PostgreSQL) 14.5 (Ubuntu 14.5-2.pgdg20.04+2) + timescaledb 2.7.1
32 CPU, 80 GB RAM
Server: HA Zabbix two nodes cluster - 6.4.0 8 CPU, 80 GB RAM
Number of hosts (enabled) - 29266
Number of items (enabled/disabled/not supported) - 11286986 / 40027 / 13672
Number of triggers (enabled/disabled) - 10272797 / 258826
Required server performance, new values per second - 22523.16
Everything is monitored by 14 active zabbix proxies.
95% of items have logrt.count keyDB: (PostgreSQL) 14.5 (Ubuntu 14.5-2.pgdg20.04+2) + timescaledb 2.7.1 32 CPU, 80 GB RAM Server: HA Zabbix two nodes cluster - 6.4.0 8 CPU, 80 GB RAM Number of hosts (enabled) - 29266 Number of items (enabled/disabled/not supported) - 11286986 / 40027 / 13672 Number of triggers (enabled/disabled) - 10272797 / 258826 Required server performance, new values per second - 22523.16 Everything is monitored by 14 active zabbix proxies. 95% of items have logrt.count key
-
Sprint 99 (Apr 2023), Sprint 100 (May 2023)
-
0.5
Steps to reproduce:
- We have very large environment with lot of problems. Most of them was created and solved in small period of time. Example: number of active triggers:
SELECT count(*) FROM problem p WHERE p.source='0' AND p.object='0' AND NOT EXISTS (SELECT NULL FROM event_suppress es WHERE es.eventid=p.eventid) AND p.cause_eventid IS NULL AND p.r_eventid IS NULL; count ------- 22627
Number of rows in 'problem' table:
SELECT count(*) FROM problem ; count -------- 469844
Size this table on disk is around 85MB.
- Once per hour we see utilization of housekeeper like 100% for 10-15 minutes. After investigation we see something strange:
housekeeper [deleted 0 hist/trends, 0 items/triggers, 0 events, 0 sessions, 0 alarms, 0 audit items, 0 records in 713.206790 sec, idle for
1 hour(s)]
For now there is nothing to delete, because it's new environment and not meet storage period yet. But is still quite bit long.
- In config:
HousekeepingFrequency=1
MaxHousekeeperDelete=10000 (our test value; change that to larger/smaller value nothing change in this kind of problem)
Result:
- Increased log_level on housekeeper. Output in attachment. Interesting line:
1044522:20230413:124551.911 slow query: 337.193469 sec, "select p1.eventid from problem p1 where p1.r_clock<>0 and p1.r_clock<1681296014 and p1.eventid not in ( select cause_eventid from problem p2 where p1.eventid=p2.cause_eventid) limit 10000"
- Check explain on database:
zabbix=# explain analyze select p1.eventid from problem p1 where p1.r_clock<>0 and p1.r_clock<1681396014 and p1.eventid not in ( select cause_eventid from problem p2 where p1.eventid=p2.cause_eventid) limit 10000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.42..174369641.30 rows=10000 width=8) (actual time=185.701..323153.930 rows=10000 loops=1) -> Index Scan using problem_2 on problem p1 (cost=0.42..1507512730.64 rows=86455 width=8) (actual time=99.502..323057.884 rows=10000 loops=1) Index Cond: (r_clock < 1681396014) Filter: ((r_clock <> 0) AND (NOT (SubPlan 1))) Rows Removed by Filter: 21924 SubPlan 1 -> Seq Scan on problem p2 (cost=0.00..16617.11 rows=1 width=8) (actual time=32.290..32.290 rows=0 loops=10000) Filter: (p1.eventid = cause_eventid) Rows Removed by Filter: 464175 Planning Time: 0.314 ms JIT: Functions: 15 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 2.016 ms, Inlining 9.124 ms, Optimization 50.737 ms, Emission 26.105 ms, Total 87.981 ms Execution Time: 323162.612 ms (15 rows)
- After more research We check something interesting. The same query without (i thing) checking cause and symptom
explain analyze select p1.eventid from problem p1 where p1.r_clock<>0 and p1.r_clock<1681396014 limit 10000; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.42..920.69 rows=10000 width=8) (actual time=61.017..67.360 rows=10000 loops=1) -> Index Scan using problem_2 on problem p1 (cost=0.42..15912.80 rows=172911 width=8) (actual time=61.015..66.868 rows=10000 loops=1) Index Cond: (r_clock < 1681396014) Filter: (r_clock <> 0) Rows Removed by Filter: 21914 Planning Time: 0.155 ms Execution Time: 67.626 ms (7 rows)
We don't use this feature yet, so column 'cause_eventid' is empty.
Expected:
Maybe some change this query or something is 'problem' tables structure, so this query will run faster.
- caused by
-
ZBXNEXT-7964 Cause and symptom events
- Closed
- depends on
-
ZBX-22889 Performance problems due to missing problem.cause_eventid index on PostgreSQL
- Closed