Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

Noah Misch <noah@xxxxxxxxxxxx> · Tue, 2 Jun 2015 01:21:21 -0400

On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote:
> On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch <noah@xxxxxxxxxxxx> wrote:
> > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote:
> >> SetMultiXactIdLimit() bracketed certain parts of its
> >> logic with if (!InRecovery), but those guards were ineffective because
> >> it gets called before InRecovery is set in the first place.
> >
> > SetTransactionIdLimit() checks InRecovery for the same things, and it is
> > called at nearly the same moments as SetMultiXactIdLimit().  Do you have a
> > sense of whether it is subject to similar problems as a result?
> 
> Well, I think it's pretty weird that those things will get done before
> beginning recovery, even on an inconsistent cluster, but not during
> recovery.  That is pretty strange.  I don't see that it can actually
> do any worse than emit a few log messages at the start of recovery
> that won't show up again until the end of recovery, though.

Granted.  Would it be better to update both functions at the same time, and
perhaps to make that a master-only change?  Does the status quo cause more
practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()?

> >> 1. Moves the call to DetermineSafeOldestOffset() that appears in
> >> StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
> >> until we're consistent.  Also, instead of passing
> >> MultiXactState->oldestMultiXactId, pass the newer of that value and
> >> the earliest offset that exists on disk.  That way, it won't try to
> >> read data that's not there.
> >
> > Perhaps emit a LOG message when we do that, since it's our last opportunity to
> > point to the potential data loss?
> 
> If the problem is just that somebody minmxid got set to 1 instead of
> the real value, I think that there is no data loss, because none of
> those older values are actually present there.  But we could add a LOG
> message anyway.  How do you suggest that we phrase that?

There's no data loss if 1 <= true_minmxid <= 2^31 at the time minmxid got set
to 1.  Otherwise, data loss is possible.  I don't hope for an actionable
message, but we might want a reporter to grep logs for it when we diagnose
future reports.  Perhaps this:

  "missing pg_multixact/members files; begins at MultiXactId %u, expected %u"

For good measure, perhaps emit this when lastCheckpointedOldest > earliest by
more than one segment:

  "excess pg_multixact/members files; begins at MultiXactId %u, expected %u"

> >> @@ -2859,6 +2947,14 @@ TruncateMultiXact(void)
> >>       SimpleLruTruncate(MultiXactOffsetCtl,
> >>                                         MultiXactIdToOffsetPage(oldestMXact));
> >>
> >> +     /* Update oldest-on-disk value in shared memory. */
> >> +     earliest = range.rangeStart * MULTIXACT_OFFSETS_PER_PAGE;
> >> +     if (earliest < FirstMultiXactId)
> >> +             earliest = FirstMultiXactId;
> >> +     LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
> >> +     Assert(MultiXactState->oldestMultiXactOnDiskValid);
> >> +     MultiXactState->oldestMultiXactOnDiskValid = earliest;
> >
> > That last line needs s/Valid//, I presume.  Is it okay that
> > oldestMultiXactOnDisk becomes too-old during TruncateMultiXact(), despite its
> > Valid indicator remaining true?
> 
> Ay yai yay.  Yes, s/Valid//.
> 
> I'm not sure what you mean about it becoming too old.  At least with
> that fix, it should get updated to exactly the first file that we
> didn't remove.  Isn't that right?

Consider a function raw_GOMXOD() that differs from GetOldestMultiXactOnDisk()
only in that it never reads or writes the cache.  I might expect
oldestMultiXactOnDisk==raw_GOMXOD() if oldestMultiXactOnDiskValid, and that
does hold most of the time.  It does not always hold between the start of the
quoted code's SimpleLruTruncate() and its oldestMultiXactOnDisk assignment.
That's because raw_GOMXOD() changes at the instant we unlink the oldest
segment, but oldestMultiXactOnDisk changes later.  Any backend may call
GetOldestMultiXactOnDisk() via SetMultiXactIdLimit().  If a backend does that
concurrent with the checkpointer running TruncateMultiXact() and sees a stale
oldestMultiXactOnDisk, is that harmless?

> >> +static MultiXactOffset
> >> +GetOldestMultiXactOnDisk(void)
> >> +{
> >
> >> +     SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
> >> +     earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
> >> +     if (earliest < FirstMultiXactId)
> >> +             earliest = FirstMultiXactId;
> >
> > SlruScanDirCbFindEarliest() is only meaningful if the files on disk do not
> > represent a wrapped state.  When the files do represent a wrapped state,
> > MultiXactIdPrecedes() is not transitive, and the SlruScanDirCbFindEarliest()
> > result is sensitive to readdir() order.  This code exists specifically to help
> > us make do in the face of wrong catalog and pg_control entries.  We may have
> > wrapped as a result of those of same catalog/pg_control entries, so I think
> > this function ought to account for wrap.  I haven't given enough thought to
> > what exactly it should do.
> 
> I can see that there might be an issue there, but I can't quite put my
> finger on it well enough to say that it definitely is an issue.  This
> code is examining the offsets space rather than the members space, and
> the protections against offsets wraparound have been there since the
> original commit of this feature
> (0ac5ad5134f2769ccbaefec73844f8504c4d6182).  To my knowledge we have
> no concrete evidence that there's ever been a problem in this area.
> It seems like it might be safer to rejigger that code so that it
> considers distance-behind-current rather than using the wrapped
> comparison logic, but I'm reluctant to start rejiggering more things
> without knowing what specifically I'm fixing.

Anything that could cause the pg_multixact/offsets tail to rotate from being
in the past to being in the future poses this risk.  (This is the tail from
the perspective of files on disk; pg_control, datminmxid, and MultiXactState
notions of the tail play no part here.)  I had in mind that the pg_upgrade
datminmxid=1 bug could be a tool for achieving that, but I've been
unsuccessful so far at building a credible thought experiment around it.  Near
the beginning of your reply, you surmised that this could happen between a
VACUUM's SetMultiXactIdLimit() and the next checkpoint's TruncateMultiXact().
Another vector is unlink() failure on a segment file.  SlruDeleteSegment()
currently ignores the unlink() return value; the only harm has been some disk
usage.  With GetOldestMultiXactOnDisk() as-proposed, successful unlink() is
mandatory to forestall the appearance of a wrapped state.

Thanks,
nm

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general