On Sat, May 30, 2015 at 1:46 PM, Andres Freund <andres@xxxxxxxxxxx> wrote: > On 2015-05-29 15:08:11 -0400, Robert Haas wrote: >> It seems pretty clear that we can't effectively determine anything >> about member wraparound until the cluster is consistent. > > I wonder if this doesn't actually hints at a bigger problem. Currently, > to determine where we need to truncate SlruScanDirectory() is > used. That, afaics, could actually be a problem during recovery when > we're not consistent. > > Consider the scenario where a very large database is copied while > running. At the start of the backup we'll determine at which checkpoint > recovery will start and store it in the label. After that the copy will > start, copying everything slowly. That works because we expect recovery > to fix things up. The problem I see WRT multixacts is that the copied > state of pg_multixact could be wildly different from the one at the > label's checkpoint. During recovery, before reaching the first > checkpoint, we'll create multixact files as used at the time of the > checkpoint. But the rest of pg_multixact may be ahead 2**31 xacts. Yes, I think the code in TruncateMultiXact that scans for the earliest multixact only works when the segment files span at most 2^31 of multixact space. If they span more than that, MultiXactIdPrecedes is no long able to provide a total ordering, so of the scan may be wrong, depending on the order that it encounters the files. Incidentally, your description of that scenario gave me an idea for how to reproduce a base backup that 9.4.2 or master can't start. I tried this first: 1. Set up with max_wal_senders = 1, wal_level = hot_standby, initdb 2. Create enough multixacts to fill a couple of segments in pg_multixacts/offsets using "explode_mxact_members 99 1000" (create foo table first) 3. Start a base backup with logs, but break in src/backend/replication/basebackup.c after sendFileWithContent(BACKUP_LABEL_FILE, labelfile); and before sending the contents of the data dir (including pg_multixacts)... (or just put a big sleep in there) 4. UPDATE pg_database SET datallowconn = true; vacuumdb --freeze --all; CHECKPOINT;, see that offsets/0000 is now gone and oldestMultiXid is 98001 in pg_control 5. ... allow the server backend to continue; the basebackup completes. Inspecting the new data directory, I see that offsets/0000 is not present as expected, and pg_control contains the oldestMultiXid 98001. Since pg_control was copied after pg_multixacts and my database didn't move between those copies, it points to a valid multixact (unlike the pg_upgrade scenario) and is able to start up, but does something different again which may or may not be good, I'm not sure: LOG: database system was interrupted; last known up at 2015-05-30 14:30:23 NZST LOG: file "pg_multixact/offsets/0000" doesn't exist, reading as zeroes LOG: redo starts at 0/7000028 LOG: consistent recovery state reached at 0/70C8898 LOG: redo done at 0/70C8898 LOG: last completed transaction was at log time 2015-05-30 14:30:17.261436+12 LOG: database system is ready to accept connections My next theory about how to get a FATAL during startup is something like this: Break in basebackup.c in between copying pg_multixacts and copying pg_control (simulating a very large/slow file copy, perhaps if 'base' happens to get copied after 'pg_multixacts', though I don't know if that's possible), and while it's stopped, generate some offsets segments, vacuum --freeze --all, checkpoint and then create a few more multixacts, then checkpoint again (so that oldestMultiXact is not equal to nextMultiXact). Continue. Now pg_control's oldestMultiXactId now points at a segment file that didn't exist when pg_multixacts was copied. I haven't managed to get this to work (ie produce a FATAL) and I'm out of time for a little while, but wanted to share this idea in case it helps someone. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general