Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

Alvaro Herrera <alvherre@xxxxxxxxxxxxxxx> · Mon, 1 Jun 2015 18:19:41 -0300

Alvaro Herrera wrote:
> Robert Haas wrote:

> > In the process of investigating this, we found a few other things that
> > seem like they may also be bugs:
> > 
> > - As noted upthread, replaying an older checkpoint after a newer
> > checkpoint has already happened may lead to similar problems.  This
> > may be possible when starting from an online base backup; or when
> > restarting a standby that did not perform a restartpoint when
> > replaying the last checkpoint before the shutdown.
> 
> I'm going through this one now, as it's closely related and caused
> issues for us.

FWIW I've spent a rather long while trying to reproduce the issue, but
haven't been able to figure out.  Thomas already commented on it: the
server notices that a file is missing and, instead of failing, it "reads
the file as zeroes".  This is because of this hack in slru.c, which is
there to cover for a pg_clog replay consideration:

	/*
	 * In a crash-and-restart situation, it's possible for us to receive
	 * commands to set the commit status of transactions whose bits are in
	 * already-truncated segments of the commit log (see notes in
	 * SlruPhysicalWritePage).  Hence, if we are InRecovery, allow the case
	 * where the file doesn't exist, and return zeroes instead.
	 */
	fd = OpenTransientFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
	if (fd < 0)
	{
		if (errno != ENOENT || !InRecovery)
		{
			slru_errcause = SLRU_OPEN_FAILED;
			slru_errno = errno;
			return false;
		}

		ereport(LOG,
				(errmsg("file \"%s\" doesn't exist, reading as zeroes",
						path)));
		MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
		return true;
	}

I was able to cause an actual problem by #ifdefing out the "if
errno/inRcovery" line, of course, but how can I be sure that fixing that
would also fix my customer's problem?  That's no good.

Anyway here's a quick script to almost-reproduce the problem.  I
constructed many variations of this, trying to find the failing one, to
no avail.  Note I'm using the pg_burn_multixact() function to create
multixacts quickly (this is cheating in itself, but it seems to me that
if I'm not able to reproduce the problem with this, I'm not able to do
so without it either.)  This script takes an exclusive backup
(cp -pr) excluding pg_multixact, then does some multixact stuff
(including truncation), then copies pg_multixact.  Evidently I'm missing
some critical element but I can't see what it is.

Another notable variant was to copy pg_multixact first, then do stuff
(create lots of multixacts, then mxact-freeze/checkpoint),
then copy the rest of the data dir.  No joy either: when replay starts,
the missing multixacts are created on the standby and so by the time we
reach the checkpoint record the pg_multixact/offset file has already
grown to the necessary size.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general