Search Postgresql Archives

Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 29, 2015 at 12:43 PM, Robert Haas <robertmhaas@xxxxxxxxx> wrote:
> Working on that now.

OK, here's a patch.  Actually two patches, differing only in
whitespace, for 9.3 and for master (ha!).  I now think that the root
of the problem here is that DetermineSafeOldestOffset() and
SetMultiXactIdLimit() were largely ignorant of the possibility that
they might be called at points in time when the cluster was
inconsistent.  SetMultiXactIdLimit() bracketed certain parts of its
logic with if (!InRecovery), but those guards were ineffective because
it gets called before InRecovery is set in the first place.

It seems pretty clear that we can't effectively determine anything
about member wraparound until the cluster is consistent.  Before then,
there might be files missing from the offsets or members SLRUs which
get put back during replay.  There could even be gaps in the sequence
of files, with some things having made it to disk before the crash (or
having made it into the backup) and others not.  So all the work of
determining what the safe stop points and vacuum thresholds for
members are needs to be postponed until TrimMultiXact() time.  And
that's fine, because we don't need this information in recovery anyway
- it only affects behavior in normal running.

So this patch does the following:

1. Moves the call to DetermineSafeOldestOffset() that appears in
StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
until we're consistent.  Also, instead of passing
MultiXactState->oldestMultiXactId, pass the newer of that value and
the earliest offset that exists on disk.  That way, it won't try to
read data that's not there.  Note that the second call to
DetermineSafeOldestOffset() in TruncateMultiXact() doesn't need a
similar guard, because we already bail out of that function early if
the multixacts we're going to truncate away don't exist.

2. Adds a new flag MultiXactState->didTrimMultiXact indicate whether
we've finished TrimMultiXact(), and arranges for SetMultiXactIdLimit()
to use that rather than InRecovery to test whether it's safe to do
complicated things that might require that the cluster is consistent.
This is a slight behavior change, since formerly we would have tried
to do that stuff very early in the startup process, and now it won't
happen until somebody completes a vacuum operation.  If that's a
problem, we could consider doing it in TrimMultiXact(), but I don't
think it's safe the way it was.  The new flag also prevents
oldestOffset from being set while in recovery; I think it would be
safe to do that in recovery once we've reached consistency, but I
don't believe it's necessary.

3. Arranges for TrimMultiXact() to set oldestOffset.  This is
necessary because, previously, we relied on SetMultiXactIdLimit doing
that during early startup or during recovery, and that's no longer
true.  Here too we set oldestOffset keeping in mind that our notion of
the oldest multixact may point to something that doesn't exist; if so,
we use the oldest MXID that does.

4. Modifies TruncateMultiXact() so that it doesn't re-scan the SLRU
directory on every call to find the oldest file that exists.  Instead,
it arranges to remember the value from the first scan and then updates
it thereafter to reflect its own truncation activity.  This isn't
absolutely necessary, but because this oldest-file logic is used in
multiple places (TrimMultiXact, SetMultiXactIdLimit, and
TruncateMultiXact all need it directly or indirectly) caching the
value seems like a better idea than recomputing it frequently.

I have tested that this patch fixes Steve Kehlet's problem, or at
least what I believe to be Steve Kehlet's problem based on the
reproduction scenario I described upthread.  I believe it will also
fix the problems with starting up from a base backup with Alvaro
mentioned upthread.  It won't fix the fact that pg_upgrade is putting
a wrong value into everybody's datminmxid field, which should really
be addressed too, but I've been working on this for about three days
virtually non-stop and I don't have the energy to tackle it right now.
If anyone feels the urge to step into that breech, I think what it
needs to do is: when upgrading from a 9.3-or-later instance, copy over
each database's datminmxid into the corresponding database in the new
cluster.

Aside from that, it's very possible that despite my best efforts this
has serious bugs.  Review and testing would be very much appreciated.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 699497c..8d28a5c 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -197,8 +197,9 @@ typedef struct MultiXactStateData
 	MultiXactOffset nextOffset;
 
 	/*
-	 * Oldest multixact that is still on disk.  Anything older than this
-	 * should not be consulted.  These values are updated by vacuum.
+	 * Oldest multixact that may still be referenced from a relation.
+	 * Anything older than this should not be consulted.  These values are
+	 * updated by vacuum.
 	 */
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
@@ -211,6 +212,18 @@ typedef struct MultiXactStateData
 	 */
 	MultiXactId lastCheckpointedOldest;
 
+	/*
+	 * This is the oldest file that actually exist on the disk.  This value
+	 * is initialized by scanning pg_multixact/offsets, and subsequently
+	 * updated each time we complete a truncation.  We need a flag to
+	 * indicate whether this has been initialized yet.
+	 */
+	MultiXactId oldestMultiXactOnDisk;
+	bool		oldestMultiXactOnDiskValid;
+
+	/* Has TrimMultiXact been called yet? */
+	bool		didTrimMultiXact;
+
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
@@ -342,6 +355,8 @@ static char *mxstatus_to_string(MultiXactStatus status);
 /* management of SLRU infrastructure */
 static int	ZeroMultiXactOffsetPage(int pageno, bool writeXlog);
 static int	ZeroMultiXactMemberPage(int pageno, bool writeXlog);
+static MultiXactOffset GetOldestMultiXactOnDisk(void);
+static MultiXactOffset GetOldestReferencedOffset(MultiXactId oldestMXact);
 static bool MultiXactOffsetPagePrecedes(int page1, int page2);
 static bool MultiXactMemberPagePrecedes(int page1, int page2);
 static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
@@ -1975,12 +1990,6 @@ StartupMultiXact(void)
 	 */
 	pageno = MXOffsetToMemberPage(offset);
 	MultiXactMemberCtl->shared->latest_page_number = pageno;
-
-	/*
-	 * compute the oldest member we need to keep around to avoid old member
-	 * data overrun.
-	 */
-	DetermineSafeOldestOffset(MultiXactState->oldestMultiXactId);
 }
 
 /*
@@ -1997,7 +2006,10 @@ TrimMultiXact(void)
 	int			pageno;
 	int			entryno;
 	int			flagsoff;
-
+	MultiXactId lastCheckpointedOldest;
+	MultiXactId oldestMultiXactId;
+	MultiXactId earliest;
+	MultiXactOffset oldestOffset;
 
 	/* Clean up offsets state */
 	LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE);
@@ -2066,6 +2078,51 @@ TrimMultiXact(void)
 	}
 
 	LWLockRelease(MultiXactMemberControlLock);
+
+	/*
+	 * Read values from shared memory so that we can establish member
+	 * wraparound defenses.
+	 */
+	LWLockAcquire(MultiXactGenLock, LW_SHARED);
+	lastCheckpointedOldest = MultiXactState->lastCheckpointedOldest;
+	oldestMultiXactId = MultiXactState->oldestMultiXactId;
+	LWLockRelease(MultiXactGenLock);
+
+	/*
+	 * Determine an initial safe stop point for multixact member wraparound.
+	 * If we are starting up without entering recovery, none of this work has
+	 * been done yet.  Even if we did recovery, the stop point might not be
+	 * set yet if the checks in TruncateMultiXact skipped truncation every
+	 * time.
+	 *
+	 * PostgreSQL 9.3.0 through 9.3.6 and PostgreSQL 9.4.0 through 9.4.1
+	 * had bugs that could allow users who reached those release through
+	 * pg_upgrade from an earlier release to end up with the bogus value of 1
+	 * for datminmxid, and that value could sometimes propagate itself back
+	 * into pg_control.  To defend against that, if the oldest multixact
+	 * value that we have doesn't actually exist on disk, use the oldest one
+	 * that does to set the safe stop point.
+	 */
+	earliest = GetOldestMultiXactOnDisk();
+	Assert(MultiXactIdIsValid(lastCheckpointedOldest));
+	if (MultiXactIdPrecedes(lastCheckpointedOldest, earliest))
+		DetermineSafeOldestOffset(earliest);
+	else
+		DetermineSafeOldestOffset(lastCheckpointedOldest);
+
+	/*
+	 * Determine the oldest offset that we still need to worry about, for
+	 * purposes of establishing how aggressively to vacuum.  This is based
+	 * on the oldest value that we believe to be present in any table, rather
+	 * than the oldest value we believe to be present on disk.  That's so that
+	 * vacuum doesn't go crazy trying to remove multixacts that it's already
+	 * cleaned out but which have not yet been removed by a checkpoint.
+	 */
+	oldestOffset = GetOldestReferencedOffset(oldestMultiXactId);
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	MultiXactState->oldestOffset = oldestOffset;
+	MultiXactState->didTrimMultiXact = true;		/* now fully initialized */
+	LWLockRelease(MultiXactGenLock);
 }
 
 /*
@@ -2165,12 +2222,19 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 	MultiXactId curMulti;
-	MultiXactOffset oldestOffset;
+	MultiXactOffset oldestOffset = 0;	/* keep compiler happy */
 	MultiXactOffset nextOffset;
+	bool		did_trim;
 
 	Assert(MultiXactIdIsValid(oldest_datminmxid));
 
 	/*
+	 * We can read this without a lock, because it only changes when nothing
+	 * else is running.
+	 */
+	did_trim = MultiXactState->didTrimMultiXact;
+
+	/*
 	 * We pretend that a wrap will happen halfway through the multixact ID
 	 * space, but that's not really true, because multixacts wrap differently
 	 * from transaction IDs.  Note that, separately from any concern about
@@ -2221,32 +2285,13 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 
 	/*
 	 * Determine the offset of the oldest multixact that might still be
-	 * referenced.  Normally, we can read the offset from the multixact itself,
-	 * but there's an important special case: if there are no multixacts in
-	 * existence at all, oldest_datminmxid obviously can't point to one.  It
-	 * will instead point to the multixact ID that will be assigned the next
-	 * time one is needed.
-	 *
-	 * NB: oldest_dataminmxid is the oldest multixact that might still be
-	 * referenced from a table, unlike in DetermineSafeOldestOffset, where we
-	 * do this same computation based on the oldest value that might still
-	 * exist in the SLRU.  This is because here we're trying to compute a
-	 * threshold for activating autovacuum, which can only remove references
-	 * to multixacts, whereas there we are computing a threshold for creating
-	 * new multixacts, which requires the old ones to have first been
-	 * truncated away by a checkpoint.
+	 * referenced, if we're done with recovery.  It isn't safe to do this
+	 * any earlier, because the database might be inconsistent.
+	 * Fortunately, we don't need it then anyway, because this only controls
+	 * the behavior of vacuum.
 	 */
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	if (MultiXactState->nextMXact == oldest_datminmxid)
-	{
-		oldestOffset = MultiXactState->nextOffset;
-		LWLockRelease(MultiXactGenLock);
-	}
-	else
-	{
-		LWLockRelease(MultiXactGenLock);
-		oldestOffset = find_multixact_start(oldest_datminmxid);
-	}
+	if (did_trim)
+		oldestOffset = GetOldestReferencedOffset(oldest_datminmxid);
 
 	/* Grab lock for just long enough to set the new limit values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
@@ -2275,11 +2320,11 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 	 */
 	if ((MultiXactIdPrecedes(multiVacLimit, curMulti) ||
 		 (nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD)) &&
-		IsUnderPostmaster && !InRecovery)
+		IsUnderPostmaster && did_trim)
 		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 
 	/* Give an immediate warning if past the wrap warn point */
-	if (MultiXactIdPrecedes(multiWarnLimit, curMulti) && !InRecovery)
+	if (MultiXactIdPrecedes(multiWarnLimit, curMulti) && did_trim)
 	{
 		char	   *oldest_datname;
 
@@ -2319,6 +2364,49 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 }
 
 /*
+ * Get the offset of the oldest MultiXact that might still be referenced from
+ * a table somewhere.
+ */
+static MultiXactOffset
+GetOldestReferencedOffset(MultiXactId oldestMXact)
+{
+	MultiXactId		earliest;
+	MultiXactOffset oldestOffset;
+
+	/*
+	 * Because of bugs in early 9.3.X and 9.4.X releases (see comments in
+	 * TrimMultiXact for details), oldest_datminmxid might point to a
+	 * nonexistent multixact.  If so, use the oldest one that actually 
+	 * exists.  Anything before this can't be successfully used anyway.
+	 */
+	earliest = GetOldestMultiXactOnDisk();
+	if (MultiXactIdPrecedes(oldestMXact, earliest))
+		oldestMXact = earliest;
+
+	LWLockAcquire(MultiXactGenLock, LW_SHARED);
+	if (MultiXactState->nextMXact == oldestMXact)
+	{
+		/*
+		 * If there are no multixacts in existence at all, oldestMXact 
+		 * obviously can't point to one.  It will instead point to the
+		 * multixact ID that will be assigned the next time one is needed.
+		 * But that means we can't look up its members, because it doesn't
+		 * exist yet.  Instead, return the offset that will be assigned to
+		 * it when it gets created.
+		 */
+		oldestOffset = MultiXactState->nextOffset;
+		LWLockRelease(MultiXactGenLock);
+	}
+	else
+	{
+		LWLockRelease(MultiXactGenLock);
+		oldestOffset = find_multixact_start(oldestMXact);
+	}
+
+	return oldestOffset;
+}
+
+/*
  * Ensure the next-to-be-assigned MultiXactId is at least minMulti,
  * and similarly nextOffset is at least minMultiOffset.
  *
@@ -2821,8 +2909,7 @@ TruncateMultiXact(void)
 {
 	MultiXactId		oldestMXact;
 	MultiXactOffset oldestOffset;
-	MultiXactOffset	nextOffset;
-	mxtruncinfo trunc;
+	MultiXactOffset nextOffset;
 	MultiXactId earliest;
 	MembersLiveRange	range;
 
@@ -2835,19 +2922,20 @@ TruncateMultiXact(void)
 	Assert(MultiXactIdIsValid(oldestMXact));
 
 	/*
-	 * Note we can't just plow ahead with the truncation; it's possible that
-	 * there are no segments to truncate, which is a problem because we are
-	 * going to attempt to read the offsets page to determine where to
-	 * truncate the members SLRU.  So we first scan the directory to determine
-	 * the earliest offsets page number that we can read without error.
+	 * We must be careful here, because we may be in recovery.  If we're here,
+	 * we've replayed at least one checkpoint, but have not yet reached the
+	 * minimum recovery point, so the truncation may have already been done.
+	 *
+	 * But even if we're in normal running, we still need to be careful.
+	 * PostgreSQL 9.3.0 through 9.3.6 and PostgreSQL 9.4.0 through 9.4.1
+	 * had bugs that could allow users who reached those release through
+	 * pg_upgrade from an earlier release to end up with the bogus value of 1
+	 * for datminmxid, and that value could sometimes propagate itself back
+	 * into pg_control.  So it's possible that oldestMXact precedes the
+	 * earliest value on disk even in normal running.  If that happens, we
+	 * skip truncation.
 	 */
-	trunc.earliestExistingPage = -1;
-	SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
-	earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
-	if (earliest < FirstMultiXactId)
-		earliest = FirstMultiXactId;
-
-	/* nothing to do */
+	earliest = GetOldestMultiXactOnDisk();
 	if (MultiXactIdPrecedes(oldestMXact, earliest))
 		return;
 
@@ -2879,7 +2967,15 @@ TruncateMultiXact(void)
 	SimpleLruTruncate(MultiXactOffsetCtl,
 					  MultiXactIdToOffsetPage(oldestMXact));
 
-	
+	/* Update oldest-on-disk value in shared memory. */
+	earliest = range.rangeStart * MULTIXACT_OFFSETS_PER_PAGE;
+	if (earliest < FirstMultiXactId)
+		earliest = FirstMultiXactId;
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	Assert(MultiXactState->oldestMultiXactOnDiskValid);
+	MultiXactState->oldestMultiXactOnDiskValid = earliest;
+	LWLockRelease(MultiXactGenLock);
+
 	/*
 	 * Now, and only now, we can advance the stop point for multixact members.
 	 * If we did it any sooner, the segments we deleted above might already
@@ -2889,6 +2985,47 @@ TruncateMultiXact(void)
 }
 
 /*
+ * Scan pg_multixact/offsets to determine the earliest offsets page number
+ * that we can read without error.
+ */
+static MultiXactOffset
+GetOldestMultiXactOnDisk(void)
+{
+	mxtruncinfo trunc;
+	MultiXactId earliest;
+	bool		valid;
+
+	/* Read values from shared memory. */
+	LWLockAcquire(MultiXactGenLock, LW_SHARED);
+	earliest = MultiXactState->oldestMultiXactOnDisk;
+	valid = MultiXactState->oldestMultiXactOnDiskValid;
+	LWLockRelease(MultiXactGenLock);
+
+	/* If the value we read is valid, just return it. */
+	if (valid)
+		return earliest;
+
+	/*
+	 * We haven't scanned the directory yet, so do that now.  This will
+	 * give us the earliest offsets page number that we can read without
+	 * error.
+	 */
+	trunc.earliestExistingPage = -1;
+	SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
+	earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
+	if (earliest < FirstMultiXactId)
+		earliest = FirstMultiXactId;
+
+	/* Update oldest-on-disk value in shared memory. */
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	MultiXactState->oldestMultiXactOnDisk = earliest;
+	MultiXactState->oldestMultiXactOnDiskValid = true;
+	LWLockRelease(MultiXactGenLock);
+
+	return earliest;
+}
+
+/*
  * Decide which of two MultiXactOffset page numbers is "older" for truncation
  * purposes.
  *
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9568ff1..aeff510 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -199,8 +199,9 @@ typedef struct MultiXactStateData
 	MultiXactOffset nextOffset;
 
 	/*
-	 * Oldest multixact that is still on disk.  Anything older than this
-	 * should not be consulted.  These values are updated by vacuum.
+	 * Oldest multixact that may still be referenced from a relation.
+	 * Anything older than this should not be consulted.  These values are
+	 * updated by vacuum.
 	 */
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
@@ -213,6 +214,18 @@ typedef struct MultiXactStateData
 	 */
 	MultiXactId lastCheckpointedOldest;
 
+	/*
+	 * This is the oldest file that actually exist on the disk.  This value
+	 * is initialized by scanning pg_multixact/offsets, and subsequently
+	 * updated each time we complete a truncation.  We need a flag to
+	 * indicate whether this has been initialized yet.
+	 */
+	MultiXactId oldestMultiXactOnDisk;
+	bool		oldestMultiXactOnDiskValid;
+
+	/* Has TrimMultiXact been called yet? */
+	bool		didTrimMultiXact;
+
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
@@ -344,6 +357,8 @@ static char *mxstatus_to_string(MultiXactStatus status);
 /* management of SLRU infrastructure */
 static int	ZeroMultiXactOffsetPage(int pageno, bool writeXlog);
 static int	ZeroMultiXactMemberPage(int pageno, bool writeXlog);
+static MultiXactOffset GetOldestMultiXactOnDisk(void);
+static MultiXactOffset GetOldestReferencedOffset(MultiXactId oldestMXact);
 static bool MultiXactOffsetPagePrecedes(int page1, int page2);
 static bool MultiXactMemberPagePrecedes(int page1, int page2);
 static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
@@ -1956,12 +1971,6 @@ StartupMultiXact(void)
 	 */
 	pageno = MXOffsetToMemberPage(offset);
 	MultiXactMemberCtl->shared->latest_page_number = pageno;
-
-	/*
-	 * compute the oldest member we need to keep around to avoid old member
-	 * data overrun.
-	 */
-	DetermineSafeOldestOffset(MultiXactState->oldestMultiXactId);
 }
 
 /*
@@ -1978,7 +1987,10 @@ TrimMultiXact(void)
 	int			pageno;
 	int			entryno;
 	int			flagsoff;
-
+	MultiXactId lastCheckpointedOldest;
+	MultiXactId oldestMultiXactId;
+	MultiXactId earliest;
+	MultiXactOffset oldestOffset;
 
 	/* Clean up offsets state */
 	LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE);
@@ -2047,6 +2059,51 @@ TrimMultiXact(void)
 	}
 
 	LWLockRelease(MultiXactMemberControlLock);
+
+	/*
+	 * Read values from shared memory so that we can establish member
+	 * wraparound defenses.
+	 */
+	LWLockAcquire(MultiXactGenLock, LW_SHARED);
+	lastCheckpointedOldest = MultiXactState->lastCheckpointedOldest;
+	oldestMultiXactId = MultiXactState->oldestMultiXactId;
+	LWLockRelease(MultiXactGenLock);
+
+	/*
+	 * Determine an initial safe stop point for multixact member wraparound.
+	 * If we are starting up without entering recovery, none of this work has
+	 * been done yet.  Even if we did recovery, the stop point might not be
+	 * set yet if the checks in TruncateMultiXact skipped truncation every
+	 * time.
+	 *
+	 * PostgreSQL 9.3.0 through 9.3.6 and PostgreSQL 9.4.0 through 9.4.1
+	 * had bugs that could allow users who reached those release through
+	 * pg_upgrade from an earlier release to end up with the bogus value of 1
+	 * for datminmxid, and that value could sometimes propagate itself back
+	 * into pg_control.  To defend against that, if the oldest multixact
+	 * value that we have doesn't actually exist on disk, use the oldest one
+	 * that does to set the safe stop point.
+	 */
+	earliest = GetOldestMultiXactOnDisk();
+	Assert(MultiXactIdIsValid(lastCheckpointedOldest));
+	if (MultiXactIdPrecedes(lastCheckpointedOldest, earliest))
+		DetermineSafeOldestOffset(earliest);
+	else
+		DetermineSafeOldestOffset(lastCheckpointedOldest);
+
+	/*
+	 * Determine the oldest offset that we still need to worry about, for
+	 * purposes of establishing how aggressively to vacuum.  This is based
+	 * on the oldest value that we believe to be present in any table, rather
+	 * than the oldest value we believe to be present on disk.  That's so that
+	 * vacuum doesn't go crazy trying to remove multixacts that it's already
+	 * cleaned out but which have not yet been removed by a checkpoint.
+	 */
+	oldestOffset = GetOldestReferencedOffset(oldestMultiXactId);
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	MultiXactState->oldestOffset = oldestOffset;
+	MultiXactState->didTrimMultiXact = true;		/* now fully initialized */
+	LWLockRelease(MultiXactGenLock);
 }
 
 /*
@@ -2146,12 +2203,19 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 	MultiXactId multiStopLimit;
 	MultiXactId multiWrapLimit;
 	MultiXactId curMulti;
-	MultiXactOffset oldestOffset;
+	MultiXactOffset oldestOffset = 0;	/* keep compiler happy */
 	MultiXactOffset nextOffset;
+	bool		did_trim;
 
 	Assert(MultiXactIdIsValid(oldest_datminmxid));
 
 	/*
+	 * We can read this without a lock, because it only changes when nothing
+	 * else is running.
+	 */
+	did_trim = MultiXactState->didTrimMultiXact;
+
+	/*
 	 * We pretend that a wrap will happen halfway through the multixact ID
 	 * space, but that's not really true, because multixacts wrap differently
 	 * from transaction IDs.  Note that, separately from any concern about
@@ -2202,32 +2266,13 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 
 	/*
 	 * Determine the offset of the oldest multixact that might still be
-	 * referenced.  Normally, we can read the offset from the multixact
-	 * itself, but there's an important special case: if there are no
-	 * multixacts in existence at all, oldest_datminmxid obviously can't point
-	 * to one.  It will instead point to the multixact ID that will be
-	 * assigned the next time one is needed.
-	 *
-	 * NB: oldest_dataminmxid is the oldest multixact that might still be
-	 * referenced from a table, unlike in DetermineSafeOldestOffset, where we
-	 * do this same computation based on the oldest value that might still
-	 * exist in the SLRU.  This is because here we're trying to compute a
-	 * threshold for activating autovacuum, which can only remove references
-	 * to multixacts, whereas there we are computing a threshold for creating
-	 * new multixacts, which requires the old ones to have first been
-	 * truncated away by a checkpoint.
+	 * referenced, if we're done with recovery.  It isn't safe to do this
+	 * any earlier, because the database might be inconsistent.
+	 * Fortunately, we don't need it then anyway, because this only controls
+	 * the behavior of vacuum.
 	 */
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	if (MultiXactState->nextMXact == oldest_datminmxid)
-	{
-		oldestOffset = MultiXactState->nextOffset;
-		LWLockRelease(MultiXactGenLock);
-	}
-	else
-	{
-		LWLockRelease(MultiXactGenLock);
-		oldestOffset = find_multixact_start(oldest_datminmxid);
-	}
+	if (did_trim)
+		oldestOffset = GetOldestReferencedOffset(oldest_datminmxid);
 
 	/* Grab lock for just long enough to set the new limit values */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
@@ -2256,11 +2301,11 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 	 */
 	if ((MultiXactIdPrecedes(multiVacLimit, curMulti) ||
 		 (nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD)) &&
-		IsUnderPostmaster && !InRecovery)
+		IsUnderPostmaster && did_trim)
 		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 
 	/* Give an immediate warning if past the wrap warn point */
-	if (MultiXactIdPrecedes(multiWarnLimit, curMulti) && !InRecovery)
+	if (MultiXactIdPrecedes(multiWarnLimit, curMulti) && did_trim)
 	{
 		char	   *oldest_datname;
 
@@ -2300,6 +2345,49 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 }
 
 /*
+ * Get the offset of the oldest MultiXact that might still be referenced from
+ * a table somewhere.
+ */
+static MultiXactOffset
+GetOldestReferencedOffset(MultiXactId oldestMXact)
+{
+	MultiXactId		earliest;
+	MultiXactOffset oldestOffset;
+
+	/*
+	 * Because of bugs in early 9.3.X and 9.4.X releases (see comments in
+	 * TrimMultiXact for details), oldest_datminmxid might point to a
+	 * nonexistent multixact.  If so, use the oldest one that actually 
+	 * exists.  Anything before this can't be successfully used anyway.
+	 */
+	earliest = GetOldestMultiXactOnDisk();
+	if (MultiXactIdPrecedes(oldestMXact, earliest))
+		oldestMXact = earliest;
+
+	LWLockAcquire(MultiXactGenLock, LW_SHARED);
+	if (MultiXactState->nextMXact == oldestMXact)
+	{
+		/*
+		 * If there are no multixacts in existence at all, oldestMXact 
+		 * obviously can't point to one.  It will instead point to the
+		 * multixact ID that will be assigned the next time one is needed.
+		 * But that means we can't look up its members, because it doesn't
+		 * exist yet.  Instead, return the offset that will be assigned to
+		 * it when it gets created.
+		 */
+		oldestOffset = MultiXactState->nextOffset;
+		LWLockRelease(MultiXactGenLock);
+	}
+	else
+	{
+		LWLockRelease(MultiXactGenLock);
+		oldestOffset = find_multixact_start(oldestMXact);
+	}
+
+	return oldestOffset;
+}
+
+/*
  * Ensure the next-to-be-assigned MultiXactId is at least minMulti,
  * and similarly nextOffset is at least minMultiOffset.
  *
@@ -2802,7 +2890,6 @@ TruncateMultiXact(void)
 	MultiXactId oldestMXact;
 	MultiXactOffset oldestOffset;
 	MultiXactOffset nextOffset;
-	mxtruncinfo trunc;
 	MultiXactId earliest;
 	MembersLiveRange range;
 
@@ -2815,19 +2902,20 @@ TruncateMultiXact(void)
 	Assert(MultiXactIdIsValid(oldestMXact));
 
 	/*
-	 * Note we can't just plow ahead with the truncation; it's possible that
-	 * there are no segments to truncate, which is a problem because we are
-	 * going to attempt to read the offsets page to determine where to
-	 * truncate the members SLRU.  So we first scan the directory to determine
-	 * the earliest offsets page number that we can read without error.
+	 * We must be careful here, because we may be in recovery.  If we're here,
+	 * we've replayed at least one checkpoint, but have not yet reached the
+	 * minimum recovery point, so the truncation may have already been done.
+	 *
+	 * But even if we're in normal running, we still need to be careful.
+	 * PostgreSQL 9.3.0 through 9.3.6 and PostgreSQL 9.4.0 through 9.4.1
+	 * had bugs that could allow users who reached those release through
+	 * pg_upgrade from an earlier release to end up with the bogus value of 1
+	 * for datminmxid, and that value could sometimes propagate itself back
+	 * into pg_control.  So it's possible that oldestMXact precedes the
+	 * earliest value on disk even in normal running.  If that happens, we
+	 * skip truncation.
 	 */
-	trunc.earliestExistingPage = -1;
-	SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
-	earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
-	if (earliest < FirstMultiXactId)
-		earliest = FirstMultiXactId;
-
-	/* nothing to do */
+	earliest = GetOldestMultiXactOnDisk();
 	if (MultiXactIdPrecedes(oldestMXact, earliest))
 		return;
 
@@ -2859,6 +2947,14 @@ TruncateMultiXact(void)
 	SimpleLruTruncate(MultiXactOffsetCtl,
 					  MultiXactIdToOffsetPage(oldestMXact));
 
+	/* Update oldest-on-disk value in shared memory. */
+	earliest = range.rangeStart * MULTIXACT_OFFSETS_PER_PAGE;
+	if (earliest < FirstMultiXactId)
+		earliest = FirstMultiXactId;
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	Assert(MultiXactState->oldestMultiXactOnDiskValid);
+	MultiXactState->oldestMultiXactOnDiskValid = earliest;
+	LWLockRelease(MultiXactGenLock);
 
 	/*
 	 * Now, and only now, we can advance the stop point for multixact members.
@@ -2869,6 +2965,47 @@ TruncateMultiXact(void)
 }
 
 /*
+ * Scan pg_multixact/offsets to determine the earliest offsets page number
+ * that we can read without error.
+ */
+static MultiXactOffset
+GetOldestMultiXactOnDisk(void)
+{
+	mxtruncinfo trunc;
+	MultiXactId earliest;
+	bool		valid;
+
+	/* Read values from shared memory. */
+	LWLockAcquire(MultiXactGenLock, LW_SHARED);
+	earliest = MultiXactState->oldestMultiXactOnDisk;
+	valid = MultiXactState->oldestMultiXactOnDiskValid;
+	LWLockRelease(MultiXactGenLock);
+
+	/* If the value we read is valid, just return it. */
+	if (valid)
+		return earliest;
+
+	/*
+	 * We haven't scanned the directory yet, so do that now.  This will
+	 * give us the earliest offsets page number that we can read without
+	 * error.
+	 */
+	trunc.earliestExistingPage = -1;
+	SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
+	earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
+	if (earliest < FirstMultiXactId)
+		earliest = FirstMultiXactId;
+
+	/* Update oldest-on-disk value in shared memory. */
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	MultiXactState->oldestMultiXactOnDisk = earliest;
+	MultiXactState->oldestMultiXactOnDiskValid = true;
+	LWLockRelease(MultiXactGenLock);
+
+	return earliest;
+}
+
+/*
  * Decide which of two MultiXactOffset page numbers is "older" for truncation
  * purposes.
  *
-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux