Re: Very slow "bloat query"

Gilles Darold <gilles@xxxxxxxxxx> · Fri, 14 May 2021 13:50:54 +0200

Le 14/05/2021 à 13:06, Marcin Gozdalik a écrit :
Hi

I am trying to use `pgmetrics` on a big (10TB+), busy (1GB/s RW) 
database. It takes around 5 minutes for pgmetrics to run. I traced the 
problem to the "bloat query" (version of 
https://wiki.postgresql.org/wiki/Show_database_bloat 
<https://wiki.postgresql.org/wiki/Show_database_bloat>) spinning in 
CPU, doing no I/O.

I have traced the problem to the bloated `pg_class` (the irony: 
`pgmetrics` does not collect bloat on `pg_catalog`):
`vacuum (full, analyze, verbose) pg_class;`
```
INFO:  vacuuming "pg_catalog.pg_class"
INFO:  "pg_class": found 1 removable, 7430805 nonremovable row 
versions in 158870 pages
DETAIL:  7429943 dead row versions cannot be removed yet.
CPU 1.36s/6.40u sec elapsed 9.85 sec.
INFO:  analyzing "pg_catalog.pg_class"
INFO:  "pg_class": scanned 60000 of 158869 pages, containing 295 live 
rows and 2806547 dead rows; 295 rows in sample, 781 estimated total rows
VACUUM
```

`pg_class` has so many dead rows because the workload is temp-table 
heavy (creating/destroying 1M+ temporary tables per day) and has long 
running analytics queries running for 24h+.

PG query planner assumes that index scan on `pg_class` will be very 
quick and plans Nested loop with Index scan. However, the index scan 
has 7M dead tuples to filter out and the query takes more than 200 
seconds (https://explain.depesz.com/s/bw2G 
<https://explain.depesz.com/s/bw2G>).

If I create a temp table from `pg_class` to contain only the live tuples:
```
CREATE TEMPORARY TABLE pg_class_alive AS SELECT oid,* from pg_class;
CREATE UNIQUE INDEX pg_class_alive_oid_index ON pg_class_alive(oid);
CREATE UNIQUE INDEX pg_class_alive_relname_nsp_index ON 
pg_class_alive(relname, relnamespace);
CREATE INDEX pg_class_tblspc_relfilenode_index ON 
pg_class_alive(reltablespace, relfilenode);
ANALYZE pg_class_alive;
```

and run the bloat query on `pg_class_alive` instead of `pg_class`:
```
SELECT
   nn.nspname AS schemaname,
   cc.relname AS tablename,
   COALESCE(cc.reltuples,0) AS reltuples,
   COALESCE(cc.relpages,0) AS relpages,
   COALESCE(CEIL((cc.reltuples*((datahdr+8-
     (CASE WHEN datahdr%8=0 THEN 8 ELSE datahdr%8 
END))+nullhdr2+4))/(8192-20::float)),0) AS otta
 FROM
    pg_class_alive cc
 JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname <> 
'information_schema'
 LEFT JOIN
 (
   SELECT
     foo.nspname,foo.relname,
     (datawidth+32)::numeric AS datahdr,
     (maxfracsum*(nullhdr+8-(case when nullhdr%8=0 THEN 8 ELSE 
nullhdr%8 END))) AS nullhdr2
   FROM (
     SELECT
       ns.nspname, tbl.relname,
       SUM((1-coalesce(null_frac,0))*coalesce(avg_width, 2048)) AS 
datawidth,
       MAX(coalesce(null_frac,0)) AS maxfracsum,
       23+(
         SELECT 1+count(*)/8
         FROM pg_stats s2
         WHERE null_frac<>0 AND s2.schemaname = ns.nspname AND 
s2.tablename = tbl.relname
       ) AS nullhdr
     FROM pg_attribute att
     JOIN pg_class_alive tbl ON att.attrelid = tbl.oid
     JOIN pg_namespace ns ON ns.oid = tbl.relnamespace
     LEFT JOIN pg_stats s ON s.schemaname=ns.nspname
     AND s.tablename = tbl.relname
     AND s.inherited=false
     AND s.attname=att.attname
     WHERE att.attnum > 0 AND tbl.relkind='r'
     GROUP BY 1,2
   ) AS foo
 ) AS rs
 ON cc.relname = rs.relname AND nn.nspname = rs.nspname
 LEFT JOIN pg_index i ON indrelid = cc.oid
 LEFT JOIN pg_class_alive c2 ON c2.oid = i.indexrelid
```

it runs in 10s, 20x faster (https://explain.depesz.com/s/K4SH 
<https://explain.depesz.com/s/K4SH>)

The rabbit hole probably goes deeper (e.g. should do the same for 
pg_statistic and pg_attribute and create a new pg_stats view).

I am not able (at least not quickly) change the amount of temporary 
tables created or make the analytics queries finish quicker. Apart 
from the above hack of filtering out live tuples to a separate table 
is there anything I could do?

Hi,

To avoid bloating your catalog with temporary tables you can try using 
https://github.com/darold/pgtt-rsl I don't know if it will fit the 
performances but at least you will not bloat the catalog anymore.

About your hack, I don't see other solution except running vacuum on the 
catalog tables more often, but I guess that this is already done or not 
possible. But not bloating the catalog at  such level is the right solution.

--
Gilles Darold
http://www.darold.net/