Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-22675

Problem with housekeeper - table 'problem'

XMLWordPrintable

    • Icon: Problem report Problem report
    • Resolution: Fixed
    • Icon: Major Major
    • 6.4.3rc1, 7.0.0alpha1, 7.0 (plan)
    • 6.4.0
    • Server (S)
    • Sprint 99 (Apr 2023), Sprint 100 (May 2023)
    • 0.5

      Steps to reproduce:

      1. We have very large environment with lot of problems. Most of them was created and solved in small period of time. Example: number of active triggers:
      SELECT count(*) FROM problem p WHERE p.source='0' AND p.object='0' AND NOT EXISTS (SELECT NULL FROM event_suppress es WHERE es.eventid=p.eventid) AND p.cause_eventid IS  NULL AND p.r_eventid IS NULL;                           count
      -------
       22627 

      Number of rows in 'problem' table:

      SELECT count(*) FROM problem ;
       count
      --------
       469844

      Size this table on disk is around 85MB.

      1. Once per hour we see utilization of housekeeper like 100% for 10-15 minutes. After investigation we see something strange:
       housekeeper [deleted 0 hist/trends, 0 items/triggers, 0 events, 0 sessions, 0 alarms, 0 audit items, 0 records in 713.206790 sec, idle for
       1 hour(s)] 

      For now there is nothing to delete, because it's new environment and not meet storage period yet. But is still quite bit long.

      1. In config: 
        HousekeepingFrequency=1
        MaxHousekeeperDelete=10000 (our test value; change that to larger/smaller value nothing change in this kind of problem)

      Result:

      1. Increased log_level on housekeeper. Output in attachment. Interesting line:
      1044522:20230413:124551.911 slow query: 337.193469 sec, "select p1.eventid from problem p1 where p1.r_clock<>0 and p1.r_clock<1681296014 and p1.eventid not in ( select cause_eventid from problem p2 where p1.eventid=p2.cause_eventid) limit 10000" 
      1. Check explain on database:
      zabbix=# explain analyze select p1.eventid from problem p1 where p1.r_clock<>0 and p1.r_clock<1681396014 and p1.eventid not in ( select cause_eventid from problem p2 where p1.eventid=p2.cause_eventid) limit 10000;
                                                                          QUERY PLAN
      ---------------------------------------------------------------------------------------------------------------------------------------------------
       Limit  (cost=0.42..174369641.30 rows=10000 width=8) (actual time=185.701..323153.930 rows=10000 loops=1)
         ->  Index Scan using problem_2 on problem p1  (cost=0.42..1507512730.64 rows=86455 width=8) (actual time=99.502..323057.884 rows=10000 loops=1)
               Index Cond: (r_clock < 1681396014)
               Filter: ((r_clock <> 0) AND (NOT (SubPlan 1)))
               Rows Removed by Filter: 21924
               SubPlan 1
                 ->  Seq Scan on problem p2  (cost=0.00..16617.11 rows=1 width=8) (actual time=32.290..32.290 rows=0 loops=10000)
                       Filter: (p1.eventid = cause_eventid)
                       Rows Removed by Filter: 464175
       Planning Time: 0.314 ms
       JIT:
         Functions: 15
         Options: Inlining true, Optimization true, Expressions true, Deforming true
         Timing: Generation 2.016 ms, Inlining 9.124 ms, Optimization 50.737 ms, Emission 26.105 ms, Total 87.981 ms
       Execution Time: 323162.612 ms
      (15 rows)
      
      1. After more research We check something interesting. The same query without (i thing) checking cause and symptom
        explain analyze select p1.eventid from problem p1 where p1.r_clock<>0 and p1.r_clock<1681396014 limit 10000;
                                                                        QUERY PLAN
        -------------------------------------------------------------------------------------------------------------------------------------------
         Limit  (cost=0.42..920.69 rows=10000 width=8) (actual time=61.017..67.360 rows=10000 loops=1)
           ->  Index Scan using problem_2 on problem p1  (cost=0.42..15912.80 rows=172911 width=8) (actual time=61.015..66.868 rows=10000 loops=1)
                 Index Cond: (r_clock < 1681396014)
                 Filter: (r_clock <> 0)
                 Rows Removed by Filter: 21914
         Planning Time: 0.155 ms
         Execution Time: 67.626 ms
        (7 rows)
        

      We don't use this feature yet, so column 'cause_eventid' is empty.
      Expected:
      Maybe some change this query or something is 'problem' tables structure, so this query will run faster.

        1. housekeeper_log.txt
          25 kB
        2. slow_query.log
          487 kB
        3. ZBX-22675.diff
          0.6 kB

            vso Vladislavs Sokurenko
            aprzybylski Albert Przybylski
            Team A
            Votes:
            6 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: