On Fri, May 29, 2015 at 11:24 AM, Robert Haas <robertmhaas@xxxxxxxxx> wrote: > A. Most obviously, we should fix pg_upgrade so that it installs > chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so > that we stop creating new instances of this problem. That won't get > us out of the hole we've dug for ourselves, but we can at least try to > stop digging. (This is assuming I'm right that chkpnt_nxtmulti is the > wrong thing - anyone want to double-check me on that one?) Yes, it seems like this could lead to truncation of multixacts still referenced by tuples, leading to errors when updating, locking, vacuuming. Why don't we have reports of that? > B. We need to change find_multixact_start() to fail softly. This is > important because it's legitimate for it to fail in recovery, as > discussed upthread, and also because we probably want to eliminate the > fail-to-start hazard introduced in 9.4.2 and 9.3.7. > find_multixact_start() is used in three places, and they each require > separate handling: Here is an experimental WIP patch that changes StartupMultiXact and SetMultiXactIdLimit to find the oldest multixact that exists on disk (by scanning the directory), and uses that if it is more recent than the oldestMultiXactId from shmem, when calling DetermineSafeOldestOffset. I'm not all that happy with it, see below, but let me know what you think. Using unpatched master, I reproduced the startup error with a bit of a short cut: 1. initdb, generate enough multixacts to get more than one offsets file 2. ALTER DATABASE template0 ALLOW_CONNECTION = true;, vacuumdb --freeze --all, CHECKPOINT 3. verify that pg_control now holds a large oldestMultiXactId, and note NextMultiXactId 4. shutdown, pg_resetxlog -m (NextMultiXactId from 3),1 pg_data 5. start up: fails Apply this patch, and it starts up successfully. What are the repro steps for the replay problem? Is a basebackup of a large database undergoing truncation and some good timing needed? > - In SetMultiXactIdLimit, find_multixact_start() is used to set > MultiXactState->oldestOffset, which is used to determine how > aggressively to vacuum. If find_multixact_start() fails, we don't > know how aggressively we need to vacuum to prevent members wraparound; > it's probably best to decide to vacuum as aggressively as possible. > Of course, if we're in recovery, we won't vacuum either way; the fact > that it fails softly is good enough. Isn't it enough to use the start offset for the most recent of the oldest multixact ID and the oldest multixact found by scanning pg_multixact/offsets? In this patch, it does that, but I'm not happy with the time the work is done, it just doesn't seem right for SetMultiXactIdLimit to be scanning that directory. The result of that operation should only change when files have been truncated anyway, and the truncate code was already doing a filesystem scan. Maybe the truncate code should store the earliest multixact ID found on disk in shared memory, so that SetMultiXactIdLimit can use it for free. I tried to get that working but couldn't figure out where it should be initialised -- StartupMultiXact is too late (StartupXLOG calls SetMultiXactIdLimit before that), but BootstrapMultiXact and MultiXactShmemInit didn't seem like the right places either. > - In DetermineSafeOldestOffset, find_multixact_start() is used to set > MultiXactState->offsetStopLimit. If it fails here, we don't know when > to refuse multixact creation to prevent wraparound. Again, in > recovery, that's fine. If it happens in normal running, it's not > clear what to do. Refusing multixact creation is an awfully blunt > instrument. Maybe we can scan pg_multixact/offsets to determine a > workable stop limit: the first file greater than the current file that > exists, minus two segments, is a good stop point. Perhaps we ought to > use this mechanism here categorically, not just when > find_multixact_start() fails. It might be more robust than what we > have now. Done in this patch -- the truncate code calls DetermineSafeOldestOffset with the earliest SLRU found by scanning if that's more recent than the shmem value, and then DetermineSafeOldestOffset applies the step-back-one-whole-segment logic to that as before. > - In TruncateMultiXact, find_multixact_start() is used to set the > truncation point for the members SLRU. If it fails here, I'm guessing > the right solution is not to truncate anything - instead, rely on > intense vacuuming to eventually advance oldestMXact to a value whose > member data still exists; truncate then. TruncateMultiXact already contained logic to do nothing at all if oldestMXact is older than the earliest it can find on disk. I moved that code into find_earliest_multixact_on_disk() to be able to use it elsewhere too, in this patch. > C. I think we should also change TruncateMultiXact() to truncate > offsets first, and then members. As things stand, if we truncate > members first, we increase the risk of seeing an offset that will fail > when passed to find_multixact_start(), because TruncateMultiXact() > might get interrupted before it finishes. That seem like an > unnecessary risk. I don't see why the order matters. find_multixact_start() doesn't read the members, only the offsets SLRU (ie the index into members, not the contents of members). As I understand it, the only time we need to access the members themselves is when we encounter multixacts in tuple headers (updating, locking or vacuuming). If you have truncated multixacts referenced in your tuples then you have a different form of corruption than the pg_upgrade-tramples-on-oldestMultiXactId case we're trying to handle gracefully here. -- Thomas Munro http://www.enterprisedb.com
Attachment:
tolerate-missing-offset-segments-wip.patch
Description: Binary data
-- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general