On Fri, Mar 20, 2020 at 12:03:17PM -0700, Andres Freund wrote: > Hi, > > On 2020-03-20 12:42:31 -0500, Justin King wrote: > > When we get into this state again, is there some other information > > (other than what is in pg_stat_statement or pg_stat_activity) that > > would be useful for folks here to help understand what is going on? > > If it's actually stuck on a single table, and that table is not large, > it would be useful to get a backtrace with gdb. FTR, we're facing a very similar issue at work (adding Michael and Kevin in Cc) during performance tests since a recent upgrade to pg12 . What seems to be happening is that after reaching 200M transaction a first pass of autovacuum freeze is being run, bumping pg_database.darfrozenxid by ~ 800k (age(datfrozenxid) still being more than autovacuum_freeze_max_age afterwards). After that point, all available information seems to indicate that no autovacuum worker is scheduled anymore: - log_autovacuum_min_duration is set to 0 and no activity is logged (while having thousands of those per hour before that) - 15 min interval snapshot of pg_database and pg_class shows that datfrozenxid/relfrozenxid keeps increasing at a consistent rate and never goes down - 15 min interval snapshot of pg_stat_activity doesn't show any autovacuum worker - the autovacuum launcher is up and running and doesn't show any sign of problem - n_mod_since_analyze keeps growing at a consistent rate, never going down - 15 min delta of tup_updated and tup_deleted shows that the globate write activity doesn't change before and after the autovacuum problem The situation continues for ~2h, at which point the bloat is so heavy that the main filesystem becomes full, and postgres panics after a failed write in pg_logical directory or similar. The same bench was run against pg11 many times and never triggered this issue. So far our best guess is a side effect of 2aa6e331ead7. Michael and I have been trying to reproduce this issue locally (drastically reducing the various freeze_age parameters) for hours, but no luck for now. This is using a vanilla pg 12.1, with some OLTP workload. The only possibly relevant configuration changes are quite aggressive autovacuum settings on some tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale factor to 0, for both heap and toast).