Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

Alvaro Herrera <alvherre@xxxxxxxxxxxxxxx> · Wed, 3 Jun 2015 16:04:47 -0300

Andres Freund wrote:
> On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote:

> > One idea I had was: what if the oldestMulti pointed to another multi
> > earlier in the same 0046 file, so that it is read-as-zeroes (and the
> > file is created), and then a subsequent multixact truncate tries to read
> > a later page in the file.  In SlruPhysicalReadPage() this would give a
> > change for open() to not fail, and then read() can fail as above.
> > However, we would have an earlier LOG message about "reading as zeroes".
> > 
> > Really, the whole question of how this code goes past the open() failure
> > in SlruPhysicalReadPage baffles me.  I don't see any possible way for
> > the file to be created ...
> 
> Wouldn't a previous WAL record zeroing another part of that segment
> explain this? A zero sized segment pretty much would lead to this error,
> right? Or were you able to check how things look after the failure?

But why would there be a previous WAL record zeroing another part of
that segment?  Note that this segment is very old -- hasn't been written
in quite a while, it's certainly not in slru buffers anymore.

> > 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
> > 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file "00000001000073DD000000AD" from archive
> > 2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status of transaction 4624559
> > 2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from file "pg_multixact/offsets/0046" at offset 147456: Success.
> > 2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) exited with exit code 1
> > 2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to startup process failure
> 
> From this isn't not entirely clear where this error was triggered from.

Well, reading code, it seems reasonable that to assume that replay of
the checkpoint record I mentioned leads to that error message when the
file exists but is not long enough to contain the given offset.  There
are not MultiXact wal records in the segment.  Also note that there's no
other "restored log file" message after the "entering standby mode"
message.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general