On 2015-06-01 14:22:32 -0400, Robert Haas wrote: > On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund <andres@xxxxxxxxxxx> wrote: > > The lack of WAL logging actually has caused problems in the 9.3.3 (?) > > era, where we didn't do any truncation during recovery... > > Right, but now we're piggybacking on the checkpoint records, and I > don't have any evidence that this approach can't be made robust. It's > possible that it can't be made robust, but that's not currently clear. Well, it's possible that it can be made work without problems. But I think robust is something different. Having to look at slrus, during recovery, to find out what to truncate puts more intelligence/complexity in the recovery path than I'm comfortable with. > >> By the time we've reached the minimum recovery point, they will have > >> been recreated by the same WAL records that created them in the first > >> place. > > > > I'm not sure that's true. I think we could end up errorneously removing > > files that were included in the base backup. Anyway, let's focus on your > > patch for now. > > OK, but I am interested in discussing the other thing too. I just > can't piece together the scenario myself - there may well be one. The > base backup will begin replay from the checkpoint caused by > pg_start_backup() and remove anything that wasn't there at the start > of the backup. But all of that stuff should get recreated by the time > we reach the minimum recovery point (end of backup). I'm not sure if it's reprouceably borked. What about this scenario: 1) pg_start_backup() is called, creates a checkpoint. 2) 2**31 multixacts are created, possibly with several checkpoints inbetween 3) pg_multixact is copied 4) basebackup finishes Unless I'm missing something this will lead to a crash recovery startup where the first call to TruncateMultiXact() will have MultiXactState->lastCheckpointedOldest wildly inconsistent with GetOldestMultiXactOnDisk() return value. Possibly leading to truncation being skipped errorneously. Whether that's a problem I'm not yet entirely sure. But what *definitely* looks wrong to me is that a TruncateMultiXact() in this scenario now (since a couple weeks ago) does a SimpleLruReadPage_ReadOnly() in the members slru via find_multixact_start(). That just won't work acceptably when we're not yet consistent. There very well could not be a valid members segment at that point? Am I missing something? > > I'm more worried about the cases where we didn't ever actually "badly > > wrap around" (i.e. overwrite needed data); but where that's not clear on > > the standby because the base backup isn't in a consistent state. > > I agree. The current patch tries to make it so that we never call > find_multixact_start() while in recovery, but it doesn't quite > succeed: the call in TruncateMultiXact still happens during recovery, > but only once we're sure that the mxact we plan to call it on actually > exists on disk. That won't be called until we replay the first > checkpoint, but that might still be prior to consistency. It'll pretty much *always* be before we reach consistency, right? It'll called on the checkpoint created by pg_start_backup()? I don't think the presence check (that's GetOldestMultiXactOnDisk() in TruncateMultiXact(), right) helps very much. There's no guarantee at all that offsets and members are in any way consistent with each other. Or in themselves for that matter, the copy could very well have been in the middle of a write the slru page. I think at the very least we'll have to skip this step while not yet consistent. That really sucks, because we'll possibly end up with multixacts that are completely filled by the time we've reached consistency. Greetings, Andres Freund -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general