Re: PG12 autovac issues

Julien Rouhaud <rjuju123@xxxxxxxxx> · Mon, 23 Mar 2020 16:22:47 +0100

On Fri, Mar 20, 2020 at 12:03:17PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2020-03-20 12:42:31 -0500, Justin King wrote:
> > When we get into this state again, is there some other information
> > (other than what is in pg_stat_statement or pg_stat_activity) that
> > would be useful for folks here to help understand what is going on?
> 
> If it's actually stuck on a single table, and that table is not large,
> it would be useful to get a backtrace with gdb.

FTR, we're facing a very similar issue at work (adding Michael and Kevin in Cc)
during performance tests since a recent upgrade to pg12 .

What seems to be happening is that after reaching 200M transaction a first pass
of autovacuum freeze is being run, bumping pg_database.darfrozenxid by ~ 800k
(age(datfrozenxid) still being more than autovacuum_freeze_max_age afterwards).
After that point, all available information seems to indicate that no
autovacuum worker is scheduled anymore:

- log_autovacuum_min_duration is set to 0 and no activity is logged (while
  having thousands of those per hour before that)
- 15 min interval snapshot of pg_database and pg_class shows that
  datfrozenxid/relfrozenxid keeps increasing at a consistent rate and never
  goes down
 - 15 min interval snapshot of pg_stat_activity doesn't show any autovacuum
   worker
- the autovacuum launcher is up and running and doesn't show any sign of
  problem
- n_mod_since_analyze keeps growing at a consistent rate, never going down
- 15 min delta of tup_updated and tup_deleted shows that the globate write
  activity doesn't change before and after the autovacuum problem

The situation continues for ~2h, at which point the bloat is so heavy that the
main filesystem becomes full, and postgres panics after a failed write in
pg_logical directory or similar.

The same bench was run against pg11 many times and never triggered this issue.
So far our best guess is a side effect of 2aa6e331ead7.

Michael and I have been trying to reproduce this issue locally (drastically
reducing the various freeze_age parameters) for hours, but no luck for now.

This is using a vanilla pg 12.1, with some OLTP workload.  The only possibly
relevant configuration changes are quite aggressive autovacuum settings on some
tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
factor to 0, for both heap and toast).