Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

Robert Haas <robertmhaas@xxxxxxxxx> · Wed, 27 May 2015 21:52:51 -0400

On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
<alvherre@xxxxxxxxxxxxxxx> wrote:
> Steve Kehlet wrote:
>> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
>> just dropped new binaries in place) but it wouldn't start up. I found this
>> in the logs:
>>
>> waiting for server to start....2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
>>  database system was shut down at 2015-05-27 13:12:55 PDT
>> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
>> starting up
>> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
>> transaction 1
>
> I am debugging today a problem currently that looks very similar to
> this.  AFAICT the problem is that WAL replay of an online checkpoint in
> which multixact files are removed fails because replay tries to read a
> file that has already been removed.

Hmm, so what exactly is the sequence of events here?  It's possible
that I'm not thinking clearly just now, but it seems to me that if
we're replaying the same checkpoint we replayed previously, the offset
of the oldest multixact will be the first file that we didn't remove.
However, I can see that there could be a problem if we try to replay
an older checkpoint after having already replayed a new one - for
example, if a standby replays checkpoint A truncating the members
multixact and performs a restart point, and then replays checkpoint B
truncating the members multixact again but without performing a
restartpoint, and then is shut down, it will resume replay from
checkpoint A, and trouble will ensue.  Is that the scenario, or is
there something else?

> I think the fix to this is to verify whether the file exists on disk
> before reading it; if it doesn't, assume the truncation has already
> happened and that it's not necessary to remove it.

That might be an OK fix, but this implementation doesn't seem very
clean.  If we're going to remove the invariant that
MultiXactState->oldestOffset will always be valid after replaying a
checkpoint, then we should be explicit about that and add a flag
indicating whether or not it's currently valid.  Shoving nextOffset in
there and hoping that's good enough seems like a bad idea to me.

I think we should modify the API for find_multixact_start.  Let's have
it return a Boolean and return oldestOffset via an out parameter.  If
!InRecovery, it will always return true and set the out parameter; but
if in recovery, it is allowed to return false without setting the out
parameter.  Both values can get stored in MultiXactState, and we can
adjust the logic elsewhere to disregard oldestOffset when the
accompanying flag is false.

This still leaves open an ugly possibility: can we reach normal
running without a valid oldestOffset?  If so, until the next
checkpoint happens, autovacuum has no clue whether it needs to worry.
There's got to be a fix for that, but it escapes me at the moment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general