Re: PG12 autovac issues

Julien Rouhaud <rjuju123@xxxxxxxxx> · Mon, 23 Mar 2020 20:47:25 +0100

Hi,

On Mon, Mar 23, 2020 at 09:23:03AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2020-03-23 16:22:47 +0100, Julien Rouhaud wrote:
> > On Fri, Mar 20, 2020 at 12:03:17PM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2020-03-20 12:42:31 -0500, Justin King wrote:
> > > > When we get into this state again, is there some other information
> > > > (other than what is in pg_stat_statement or pg_stat_activity) that
> > > > would be useful for folks here to help understand what is going on?
> > > 
> > > If it's actually stuck on a single table, and that table is not large,
> > > it would be useful to get a backtrace with gdb.
> > 
> > FTR, we're facing a very similar issue at work (adding Michael and Kevin in Cc)
> > during performance tests since a recent upgrade to pg12 .
> >
> > What seems to be happening is that after reaching 200M transaction a first pass
> > of autovacuum freeze is being run, bumping pg_database.darfrozenxid by ~ 800k
> > (age(datfrozenxid) still being more than autovacuum_freeze_max_age
> > afterwards).
> 
> If you have older transactions around that'd not be surprising. Do you
> have autovacuum logging output for this case?

There's an hourly long running query that can retain xmin up to a few million
xid, but definitely not something close to 200M.

When I said a first pass, it a batch of vacuum.  Here's the overview of "to
prevent wraparound" per-hour logs (UTC timezone):

 442 2020-03-14 07
1686 2020-03-14 08
  14 2020-03-14 10

and an overview of all autovacuum activity (still in UTC):

1366 2020-03-14 00
1457 2020-03-14 01
1387 2020-03-14 02
1440 2020-03-14 03
1349 2020-03-14 04
7383 2020-03-14 05
13909 2020-03-14 06
14240 2020-03-14 07
2094 2020-03-14 08
   0 2020-03-14 09
  16 2020-03-14 10
   3 2020-03-14 11
   4 2020-03-14 12
   3 2020-03-14 13

The final outage being:
2020-03-14 10:27:23.280 UTC [...] ERROR: could not extend file "base/16386/20245.4": No space left on device

536222:2020-03-14 10:40:00.089 UTC [...] PANIC:  could not write to file "pg_logical/replorigin_checkpoint.tmp": No space left on device

the last autovacuum evidence before that being:

274177:2020-03-14 08:54:11.797 UTC 5e6c8ed0.d665 0   LOG:  automatic vacuum of table "pg_toast.pg_toast_20237": index scans: 1
274178-	pages: 0 removed, 273666 remain, 0 skipped due to pins, 251545 skipped frozen
274179-	tuples: 83585 removed, 749 remain, 209 are dead but not yet removable, oldest xmin: 210363848
274180-	buffer usage: 50096 hits, 23521 misses, 19996 dirtied
274181-	avg read rate: 99.195 MB/s, avg write rate: 84.329 MB/s
274182-	system usage: CPU: user: 0.28 s, system: 0.21 s, elapsed: 1.85 s

After the crash-and-restart autovacuum is working again (as seen in the 10AM -
1PM logs), although the bench stays stopped.

What I can see in pg_database is (GMT+1 here):

          timestamp            | datfrozenxid | age(datfrozenxid)
 [...]
 2020-03-14 09:12:11.279515+01 |          480 | 202554804
 2020-03-14 09:27:12.723617+01 |          480 | 205408276
 2020-03-14 09:42:13.868701+01 |          480 | 208239967
 2020-03-14 09:57:15.685827+01 |       827585 | 210143294
 2020-03-14 10:12:17.488993+01 |       827585 | 213143797
 2020-03-14 10:27:18.899525+01 |       827585 | 216104516
 2020-03-14 10:42:19.926601+01 |       827585 | 219075040
 2020-03-14 10:57:21.023513+01 |       827585 | 222085423
 2020-03-14 11:12:22.85198+01  |       827585 | 225057731

> > After that point, all available information seems to indicate that no
> > autovacuum worker is scheduled anymore:
> 
> Do you mean that this table doesn't get autovac'ed at all anymore, that
> no table is getting autovac'd, or just that there's nothing further
> increasing relfrozenxid for that table?
> 
> It sounds like:
> 
> > - log_autovacuum_min_duration is set to 0 and no activity is logged (while
> >   having thousands of those per hour before that)
> 
> no table at all?

Correct, no table being autovacuumed, no sign of autovacuum being scheduled or
anything.

> > - 15 min interval snapshot of pg_database and pg_class shows that
> >   datfrozenxid/relfrozenxid keeps increasing at a consistent rate and never
> >   goes down
> 
> I assume you mean their age?

Yes sorry.

> What is:
> - datfrozenxid, age(datfrozenxid) for the oldest database
>   SELECT datname, age(datfrozenxid), datfrozenxid FROM pg_database ORDER BY age(datfrozenxid) DESC LIMIT 1;

see above.  FTR there's only one database being used.

> - relfrozenxid, age(relfrozenxid) for the oldest table in the oldest database
>   SELECT oid::regclass, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relfrozenxid <> 0 ORDER BY age(relfrozenxid) DESC LIMIT 1;

The vm has been trashed since, and I don't have that level of detail available
in the gathered stats unfortunately (and the available information I have is a
little bit messy, sorry for that).

> - Oldest backend xmin
>   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3;
> - oldest replication xmin:
>   SELECT pid, backend_xmin, age(backend_xmin) FROM pg_stat_replication WHERE backend_xmin <> 0 ORDER BY age(backend_xmin) DESC LIMIT 3;
> - oldest slot xmin:
>   SELECT slot_name, xmin, age(xmin), catalog_xmin, age(catalog_xmin) FROM pg_replication_slots WHERE xmin <> 0 OR catalog_xmin <> 0 ORDER BY greatest(age(xmin), age(catalog_xmin)) DESC LIMIT 3;
> - oldest prepared transaction (unfortunately xmin not available)
>   SELECT gid, database, transaction FROM pg_prepared_xacts ORDER BY age(transaction) LIMIT 3;

I have output from this query: https://github.com/OPMDG/check_pgactivity/blob/oldest_xmin/check_pgactivity#L5669-L5732
(which may be dumb).  Here are the non-null rows after 8AM GMT+1:

           timestamp           |    kind     |   age   |   x_min
-------------------------------+-------------+---------+-----------
 2020-03-14 08:12:04.082523+01 | active_xact |     968 | 191022091
 2020-03-14 08:12:04.082523+01 | idle_xact   |     968 | 191022091
 2020-03-14 08:12:04.082523+01 | query       | 2538322 | 188484737
 2020-03-14 08:27:07.046938+01 | active_xact |     606 | 193917749
 2020-03-14 08:27:07.046938+01 | idle_xact   |     606 | 193917749
 2020-03-14 08:27:07.046938+01 | query       |   39646 | 193878709
 2020-03-14 08:42:08.902248+01 | active_xact |     502 | 196764934
 2020-03-14 08:42:08.902248+01 | idle_xact   |     502 | 196764934
 2020-03-14 08:42:08.902248+01 | query       |   12787 | 196752649
 2020-03-14 08:57:10.089606+01 | active_xact |      91 | 199684930
 2020-03-14 08:57:10.089606+01 | idle_xact   |      91 | 199684930
 2020-03-14 08:57:10.089606+01 | query       |   25453 | 199659568
 2020-03-14 09:12:11.279515+01 | active_xact |     264 | 202555020
 2020-03-14 09:12:11.279515+01 | idle_xact   |     264 | 202555020
 2020-03-14 09:12:11.279515+01 | query       | 2316084 | 200239200
 2020-03-14 09:27:12.723617+01 | active_xact |      90 | 205408666
 2020-03-14 09:27:12.723617+01 | idle_xact   |      90 | 205408666
 2020-03-14 09:27:12.723617+01 | query       | 5169556 | 200239200
 2020-03-14 09:42:13.868701+01 | active_xact |     381 | 208240066
 2020-03-14 09:42:13.868701+01 | idle_xact   |     381 | 208240066
 2020-03-14 09:42:13.868701+01 | query       |  934819 | 207305628
 2020-03-14 09:57:15.685827+01 | active_xact |     791 | 210970088
 2020-03-14 09:57:15.685827+01 | idle_xact   |     791 | 210970088
 2020-03-14 09:57:15.685827+01 | query       |     791 | 210970088
 2020-03-14 10:12:17.488993+01 | active_xact |     243 | 213971139
 2020-03-14 10:12:17.488993+01 | idle_xact   |      61 | 213971321
 2020-03-14 10:27:18.899525+01 | active_xact |     105 | 216931996
 2020-03-14 10:27:18.899525+01 | idle_xact   |     105 | 216931996
 2020-03-14 10:27:18.899525+01 | query       |     109 | 216931992
 2020-03-14 10:42:19.926601+01 | active_xact | 1046073 | 218856552
 2020-03-14 10:42:19.926601+01 | idle_xact   |      14 | 219902611
 2020-03-14 10:42:19.926601+01 | query       |     234 | 219902391
 2020-03-14 10:57:21.023513+01 | active_xact |     707 | 222912301
 2020-03-14 10:57:21.023513+01 | idle_xact   |     707 | 222912301
 2020-03-14 10:57:21.023513+01 | query       |     707 | 222912301
 2020-03-14 11:12:22.85198+01  | active_xact |     302 | 225885014
 2020-03-14 11:12:22.85198+01  | idle_xact   |     302 | 225885014
 2020-03-14 11:12:22.85198+01  | query       |     302 | 225885014

> > The same bench was run against pg11 many times and never triggered this issue.
> > So far our best guess is a side effect of 2aa6e331ead7.
> 
> In that case you'd likely see DEBUG1 output, right? Did you try running
> with that?

That's unfortunately not an option, as while the issue is reproducible, it
happens after roughly 16h with an average of 6.5ktps, and we definitely don't
have enough disk space for that amount of logs.  We'll try next time with a
recompiled version with the DEBUG1 promoted to WARNING, and maybe other elog.

> > Michael and I have been trying to reproduce this issue locally (drastically
> > reducing the various freeze_age parameters) for hours, but no luck for now.
> 
> Have you considered using gdb to investigate?

Not for now, because it's hard to be around for the ~ 90min interval the system
stays alive after atuvacuum get stuck, especially since it's a weekend bench.

> > This is using a vanilla pg 12.1, with some OLTP workload.  The only possibly
> > relevant configuration changes are quite aggressive autovacuum settings on some
> > tables (no delay, analyze/vacuum threshold to 1k and analyze/vacuum scale
> > factor to 0, for both heap and toast).
> 
> That, uh, is not an insignificant set of changes for an autovac
> scheduling issues.
> 
> It sounds like what might be happening is that you have something
> holding back the "oldest needed transaction" horizon. Before
> 2aa6e331ead7, if the xmin horizon hasn't increased, there'd be repeated
> vacuums unable to increase the horizon, but afterwards they'll all
> immediately exit without any messages.

The xmin horizon is increasing AFAICS.  Grepping the autovacuum logs, I can see
for the 08:00 AM UTC - 08:59 AM UTC interval the oldest xmin going from
200069684 to 210363848.  E.g.:

[...]
266603:2020-03-14 08:49:59.927 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27683": index scans: 0
266604-	pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
266605-	tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635921
266606-	buffer usage: 25 hits, 1 misses, 1 dirtied
266607-	avg read rate: 15.440 MB/s, avg write rate: 15.440 MB/s
266608-	system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
266609:2020-03-14 08:49:59.929 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27694": index scans: 0
266610-	pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
266611-	tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
266612-	buffer usage: 25 hits, 1 misses, 1 dirtied
266613-	avg read rate: 21.058 MB/s, avg write rate: 21.058 MB/s
266614-	system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
266615:2020-03-14 08:49:59.931 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_24845": index scans: 0
266616-	pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
266617-	tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635934
266618-	buffer usage: 25 hits, 1 misses, 2 dirtied
266619-	avg read rate: 8.005 MB/s, avg write rate: 16.009 MB/s
266620-	system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
266621:2020-03-14 08:49:59.945 UTC LOG:  automatic aggressive vacuum to prevent wraparound of table "pg_toast.pg_toast_27712": index scans: 0
266622-	pages: 0 removed, 0 remain, 0 skipped due to pins, 0 skipped frozen
266623-	tuples: 0 removed, 0 remain, 0 are dead but not yet removable, oldest xmin: 209635953
[...]

> I wonder if what might be happening is that we're somehow missed/failed
> to update relfrozenxid and/or datfrozenxid. If you manually vacuum some
> table in the oldest database, but that is *NOT* the oldest table itself,
> does the problem "resolve" itself?

I'll also add pg_class snapshot for next time we run the bench.  I'm not sure
if we'll be able to catch the 1h45 interval when the system stays alive
after the issue though.