On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote: > On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch <noah@xxxxxxxxxxxx> wrote: > > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: > >> SetMultiXactIdLimit() bracketed certain parts of its > >> logic with if (!InRecovery), but those guards were ineffective because > >> it gets called before InRecovery is set in the first place. > > > > SetTransactionIdLimit() checks InRecovery for the same things, and it is > > called at nearly the same moments as SetMultiXactIdLimit(). Do you have a > > sense of whether it is subject to similar problems as a result? > > Well, I think it's pretty weird that those things will get done before > beginning recovery, even on an inconsistent cluster, but not during > recovery. That is pretty strange. I don't see that it can actually > do any worse than emit a few log messages at the start of recovery > that won't show up again until the end of recovery, though. Granted. Would it be better to update both functions at the same time, and perhaps to make that a master-only change? Does the status quo cause more practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()? > >> 1. Moves the call to DetermineSafeOldestOffset() that appears in > >> StartupMultiXact() to TrimMultiXact(), so that we don't try to do this > >> until we're consistent. Also, instead of passing > >> MultiXactState->oldestMultiXactId, pass the newer of that value and > >> the earliest offset that exists on disk. That way, it won't try to > >> read data that's not there. > > > > Perhaps emit a LOG message when we do that, since it's our last opportunity to > > point to the potential data loss? > > If the problem is just that somebody minmxid got set to 1 instead of > the real value, I think that there is no data loss, because none of > those older values are actually present there. But we could add a LOG > message anyway. How do you suggest that we phrase that? There's no data loss if 1 <= true_minmxid <= 2^31 at the time minmxid got set to 1. Otherwise, data loss is possible. I don't hope for an actionable message, but we might want a reporter to grep logs for it when we diagnose future reports. Perhaps this: "missing pg_multixact/members files; begins at MultiXactId %u, expected %u" For good measure, perhaps emit this when lastCheckpointedOldest > earliest by more than one segment: "excess pg_multixact/members files; begins at MultiXactId %u, expected %u" > >> @@ -2859,6 +2947,14 @@ TruncateMultiXact(void) > >> SimpleLruTruncate(MultiXactOffsetCtl, > >> MultiXactIdToOffsetPage(oldestMXact)); > >> > >> + /* Update oldest-on-disk value in shared memory. */ > >> + earliest = range.rangeStart * MULTIXACT_OFFSETS_PER_PAGE; > >> + if (earliest < FirstMultiXactId) > >> + earliest = FirstMultiXactId; > >> + LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE); > >> + Assert(MultiXactState->oldestMultiXactOnDiskValid); > >> + MultiXactState->oldestMultiXactOnDiskValid = earliest; > > > > That last line needs s/Valid//, I presume. Is it okay that > > oldestMultiXactOnDisk becomes too-old during TruncateMultiXact(), despite its > > Valid indicator remaining true? > > Ay yai yay. Yes, s/Valid//. > > I'm not sure what you mean about it becoming too old. At least with > that fix, it should get updated to exactly the first file that we > didn't remove. Isn't that right? Consider a function raw_GOMXOD() that differs from GetOldestMultiXactOnDisk() only in that it never reads or writes the cache. I might expect oldestMultiXactOnDisk==raw_GOMXOD() if oldestMultiXactOnDiskValid, and that does hold most of the time. It does not always hold between the start of the quoted code's SimpleLruTruncate() and its oldestMultiXactOnDisk assignment. That's because raw_GOMXOD() changes at the instant we unlink the oldest segment, but oldestMultiXactOnDisk changes later. Any backend may call GetOldestMultiXactOnDisk() via SetMultiXactIdLimit(). If a backend does that concurrent with the checkpointer running TruncateMultiXact() and sees a stale oldestMultiXactOnDisk, is that harmless? > >> +static MultiXactOffset > >> +GetOldestMultiXactOnDisk(void) > >> +{ > > > >> + SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc); > >> + earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE; > >> + if (earliest < FirstMultiXactId) > >> + earliest = FirstMultiXactId; > > > > SlruScanDirCbFindEarliest() is only meaningful if the files on disk do not > > represent a wrapped state. When the files do represent a wrapped state, > > MultiXactIdPrecedes() is not transitive, and the SlruScanDirCbFindEarliest() > > result is sensitive to readdir() order. This code exists specifically to help > > us make do in the face of wrong catalog and pg_control entries. We may have > > wrapped as a result of those of same catalog/pg_control entries, so I think > > this function ought to account for wrap. I haven't given enough thought to > > what exactly it should do. > > I can see that there might be an issue there, but I can't quite put my > finger on it well enough to say that it definitely is an issue. This > code is examining the offsets space rather than the members space, and > the protections against offsets wraparound have been there since the > original commit of this feature > (0ac5ad5134f2769ccbaefec73844f8504c4d6182). To my knowledge we have > no concrete evidence that there's ever been a problem in this area. > It seems like it might be safer to rejigger that code so that it > considers distance-behind-current rather than using the wrapped > comparison logic, but I'm reluctant to start rejiggering more things > without knowing what specifically I'm fixing. Anything that could cause the pg_multixact/offsets tail to rotate from being in the past to being in the future poses this risk. (This is the tail from the perspective of files on disk; pg_control, datminmxid, and MultiXactState notions of the tail play no part here.) I had in mind that the pg_upgrade datminmxid=1 bug could be a tool for achieving that, but I've been unsuccessful so far at building a credible thought experiment around it. Near the beginning of your reply, you surmised that this could happen between a VACUUM's SetMultiXactIdLimit() and the next checkpoint's TruncateMultiXact(). Another vector is unlink() failure on a segment file. SlruDeleteSegment() currently ignores the unlink() return value; the only harm has been some disk usage. With GetOldestMultiXactOnDisk() as-proposed, successful unlink() is mandatory to forestall the appearance of a wrapped state. Thanks, nm -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general