Uploaded image for project: 'ZABBIX BUGS AND ISSUES'
  1. ZABBIX BUGS AND ISSUES
  2. ZBX-19897

Bug in template PostgreSQL by agent2 dependent item

XMLWordPrintable

    • Sprint 80 (Sep 2021), Sprint 81 (Oct 2021), Sprint 82 (Nov 2021), Sprint 83 (Dec 2021)
    • 4

      Master tas for PostgreSQL issues:

      (1) from https://support.zabbix.com/browse/ZBX-19873
      In database discovery there is an item "Checksum failures".
      It has 2 preprocessing options, JSONPATH and CHANGE PER SECOND.
      It is possible that DB returns value NULL for checksum failures, in this scenario item preprocessing fails and item becomes unsupported.
      In addition, because other parameters are changing in the master item, this ends up filling log files with long error messages, because the error text changes every time.

      Please add another preprocessing step to validate the data or even replace NULL with 0 to avoid this.

      (2) from https://support.zabbix.com/browse/ZBX-19735

      Hi!
      I have recovered from backup pgsql server (Point in time recovery).

      postgres=# select pg_last_xact_replay_timestamp();
       pg_last_xact_replay_timestamp 
      -------------------------------
       2021-06-26 22:11:23.259069+03
      

      Item "Replication: lag in seconds":

      pgsql.replication.lag.sec["{$PG.HOST}","{$PG.PORT}","{$PG.USER}","{$PG.DB}"]
      

      And looks like trigger not working correctly for me.
      I got errors event like

      PostgreSQL: Streaming lag with {#MASTER} is too high (over {$PG.REPL_LAG.MAX.WARN} in 5m)
      

      In my case I change user parameter SQL from

      CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
      

      to

      CASE WHEN pg_last_wal_receive_lsn() is NULL or pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
      

      And it corrects the issue.

      Hope you can update this script.
      Thank you!

      (3) from https://support.zabbix.com/browse/ZBX-19734 

      In PostgreSQL Agent 2 template documentation:

      https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/db/postgresql_agent2?at=release/5.0

      In section

      1. Create PostgreSQL user for monitoring (<password> at your discretion):

      Missing important GRANT section:

      GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO zbx_monitor;

      Otherwise in fronted error appear
      Cannot fetch data: ERROR: permission denied for function pg_ls_waldir (SQLSTATE 42501).

      (4) from https://support.zabbix.com/browse/ZBX-19672

      When using the template PostgreSQL by Zabbix agent 2.

      There is an error in the database(slave+master) log :

      ERROR: column "received_lsn" does not exist at character 36
      HINT: Perhaps you meant to reference the column "pg_stat_wal_receiver.received_tli".
      STATEMENT: SELECT pg_catalog.pg_wal_lsn_diff (received_lsn, pg_last_wal_replay_lsn())
      FROM pg_stat_wal_receiver;

      (5) from https://support.zabbix.com/browse/ZBX-18889
      Steps to reproduce:

      1. Install postgresql 13 primary and standby instance ,add them with agent2 template
      2. Navigate to items of standby host
      3. Change filter to not supported
      4. You will see Template DB PostgreSQL Agent 2Replication: Lag in bytes not supported because received_lsn column changed to written_lsn in postgresql 13.
      5. Also there is problem with item  Template DB PostgreSQL Agent 2: PostgreSQL: Get bgwriter: Checkpoint: Checkpoint sync time. in json sync_time called checkpoint_sync_time. But it can be channed in template, just updating you..

      Result:
      See screenshot...

      Expected:
      Replication lag work with postgres 13

      (6)

      Bug in template PostgreSQL by agent2 dependent item "WAL: Segments count", Preprocessing: JSONPATH: $.write has to be JSONPATH: $.count

       

       

      (7)

      Archiving metrics

      Count of files in archive_status need to archive (pgsql.archive.count_files_to_archive)
      Count of files need to archive (pgsql.archive.size_files_to_archive)

      The metric is calculated by directly browsing the FS catalog. For this, functions are used that are available by default only to the superuser. You can instead use information from system views and functions.

      Connection metrics

      Connections sum: Total (pgsql.connections.total)
      Connections sum: Total% (pgsql.connections.total_pct)
      Connections sum: Waiting (pgsql.connections.waiting)

      The calculation does not exclude system processes.

      Cache hit (pgsql.cache.hit ["\{$ PG.URI}"])
      Calculates the metric for the entire running time of the cluster. Mamonsu got around this by creating a Calculated Item that takes deltas by other metrics (pgsql.dbstat.sum.blks_hit.rate, pgsql.dbstat.sum.blks_read.rate).

      flush lag, write lag and replay lag per each sender process (pgsql.replication.process [uri, username, password])
      When forming an item, the PID of the WAL transmitter is used, which constantly changes. As a result, over time, a huge bunch of lag metrics will be generated for the host.
      There are also a couple of metrics that have raised questions. This is not to say that they are erroneous, but certainly not the most convenient and understandable. Metrics:

      Number of standby servers (pgsql.replication.count [uri, username, password])
      The meaning of the metric is not entirely correct. It is stated that standby servers will be counted, however, if there are multiple subscriptions to a logical replica, they will all be counted as different servers.

      PostgreSQL uptime, in milliseconds (pgsql.uptime [<commonParams>])
      It is more convenient to measure in unixtime and display the launch date and time.
      More about this source text

            vkhaliev Vyacheslav Khaliev
            alarionovs Aleksandrs Larionovs (Inactive)
            Team INT
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: