Finding out if 'vacuum --analyze-in-stages' has generated enough stats

Vineet Naik <naikvin@xxxxxxxxx> · Wed, 31 Mar 2021 12:38:14 +0530

Hello,

Let me provide some context. I am evaluating `pg_upgrade` for
upgrading a pg 9.3 server to 9.6. There are multiple dbs, only one of
which is large (~3.5 TB with indexes) and the others are pretty small.

For a trial run and to understand the approximate duration of
downtime, I restored a basebackup on a standalone node and ran
`pg_upgrade` with `--link` flag. This command completed in only ~30s.
But after that the `analyze_new_cluster.sh` script took 130 hours to
complete. This much downtime is definitely not acceptable for us.

Looking inside the `analyze_new_cluster` script, it runs

"/usr/lib/postgresql/9.6/bin/vacuumdb" --all --analyze-in-stages

to generate statistics which are used for determining efficient query
execution plans. So I'm assuming that resuming reads to the server
without running ANALYZE would result in considerable performance
degradation.

Now, only 2 of the tables across all dbs contribute to most of the
size (738 GB + 916 GB). If `vacuumdb` is run separately for smaller
tables then the stats for those will be generated quickly rather than
waiting for each stage to complete for the 2 large tables. So I'm
planning to use this approach.

If I run `vacuumdb --analyze-in-stages` for largest table, then it prints,

vacuumdb: processing database "<dbname>": Generating minimal optimizer
statistics (1 target)

And then nothing gets printed for several hours. It's not clear if
it's taking that long to process 1 target or if it has started
processing for 10 targets. Since it says "processing database" I'm
assuming it has not yet completed processing for 1 target.

As per documentation about --analyze-in-stages,

> This option will try to create some statistics as fast as possible, to make the database usable, and then produce full statistics in the subsequent stages.

So my question is, after which stage is it ok to send traffic to the
server without much performance impact? Also, is it possible to query
the stats view in pg and determine if enough stats have been
generated?

Thanks,
Vineet