Re: VACUUM unable to accomplish because of a non-existent MultiXactId

Kouber Saparev <kouber@xxxxxxxxx> · Sat, 28 Nov 2015 12:52:58 +0200

The state of your data is probably caused by some weird corner case of
the upgrade.  Can you see in the log files that the toast table has been
failing vacuum since the upgrade, or is it more recent than that?  (In
other words, is there any working vacuum after the upgrade?)

We upgraded to 9.4.5 on 19 October, and there was a successful automatic vacuum over pg_toast_376621 just 3 days later - on 22 October:

Oct 22 08:16:49 db-master postgres[10589]: [3-1] []: LOG:  automatic vacuum of table “db.pg_toast.pg_toast_376621": index scans: 1
Oct 22 08:16:49 db-master postgres[10589]: [3-2]      pages: 0 removed, 784361 remain
Oct 22 08:16:49 db-master postgres[10589]: [3-3]      tuples: 110 removed, 3768496 remain, 0 are dead but not yet removable
Oct 22 08:16:49 db-master postgres[10589]: [3-4]      buffer usage: 37193 hits, 44891 misses, 32311 dirtied
Oct 22 08:16:49 db-master postgres[10589]: [3-5]      avg read rate: 0.954 MB/s, avg write rate: 0.686 MB/s
Oct 22 08:16:49 db-master postgres[10589]: [3-6]      system usage: CPU 1.10s/1.67u sec elapsed 367.73 sec

The next automatic vacuum came 8 days later - on 30 October and failed and it is failing ever since:

Oct 30 14:22:01 db-master postgres[16160]: [3-1] []: ERROR:  MultiXactId 2915905228 does no longer exist -- apparent wraparound
Oct 30 14:22:01 db-master postgres[16160]: [3-2] []: CONTEXT:  automatic vacuum of table “db.pg_toast.pg_toast_376621”

So I guess something happened between 22 and 30 October and there is no relation to the pg_upgrade we did on 19 October.

It would be useful to debug this that you attached gdb to a backend, set
breakpoint on errfinish, then run vacuum on that table.  Then you can
extract the page number from the backtrace.  With the page number we can
try pageinspect and heap_page_items until we find the culprit and
perhaps identify how it got in that state.

I will try to obtain the page number, and will then send you the results, thank you.

Can we somehow do it on one of our replicas (after detaching it), i.e. is the corrupted record propagated through the replication channel, and in the meantime fix the table on the master?

Thanks!

—
Kouber Saparev