Kouber Saparev wrote: > > > The state of your data is probably caused by some weird corner case of > > the upgrade. Can you see in the log files that the toast table has been > > failing vacuum since the upgrade, or is it more recent than that? (In > > other words, is there any working vacuum after the upgrade?) > > We upgraded to 9.4.5 on 19 October, and there was a successful automatic vacuum over pg_toast_376621 just 3 days later - on 22 October: > > Oct 22 08:16:49 db-master postgres[10589]: [3-1] []: LOG: automatic vacuum of table “db.pg_toast.pg_toast_376621": index scans: 1 > Oct 22 08:16:49 db-master postgres[10589]: [3-2] pages: 0 removed, 784361 remain > Oct 22 08:16:49 db-master postgres[10589]: [3-3] tuples: 110 removed, 3768496 remain, 0 are dead but not yet removable > Oct 22 08:16:49 db-master postgres[10589]: [3-4] buffer usage: 37193 hits, 44891 misses, 32311 dirtied > Oct 22 08:16:49 db-master postgres[10589]: [3-5] avg read rate: 0.954 MB/s, avg write rate: 0.686 MB/s > Oct 22 08:16:49 db-master postgres[10589]: [3-6] system usage: CPU 1.10s/1.67u sec elapsed 367.73 sec > > The next automatic vacuum came 8 days later - on 30 October and failed and it is failing ever since: > > Oct 30 14:22:01 db-master postgres[16160]: [3-1] []: ERROR: MultiXactId 2915905228 does no longer exist -- apparent wraparound > Oct 30 14:22:01 db-master postgres[16160]: [3-2] []: CONTEXT: automatic vacuum of table “db.pg_toast.pg_toast_376621” > > So I guess something happened between 22 and 30 October and there is no relation to the pg_upgrade we did on 19 October. OK, so it wasn't pg_upgrade. Good to know. > > It would be useful to debug this that you attached gdb to a backend, set > > breakpoint on errfinish, then run vacuum on that table. Then you can > > extract the page number from the backtrace. With the page number we can > > try pageinspect and heap_page_items until we find the culprit and > > perhaps identify how it got in that state. > > I will try to obtain the page number, and will then send you the results, thank you. > > Can we somehow do it on one of our replicas (after detaching it), i.e. > is the corrupted record propagated through the replication channel, > and in the meantime fix the table on the master? I suppose the corrupted data should be in any replicas as well, but I have no way to be sure. Also, I don't know how to get the data to a valid state, other than dropping the table and reloading it. I suppose you can take a complete pg_dump of the data in that table? If you want to reset it to a known-good state, you could just save the pg_toast file aside (and its corresponding main table file, just in case). Doing forensics in a copy is better practice anyway, you know. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin