Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

Thomas Munro <thomas.munro@xxxxxxxxxxxxxxxx> · Wed, 17 Jun 2015 09:41:32 +1200

On Wed, Jun 17, 2015 at 6:58 AM, Alvaro Herrera
<alvherre@xxxxxxxxxxxxxxx> wrote:
> Thomas Munro wrote:
>
>> Thanks.   As mentioned elsewhere in the thread, I discovered that the
>> same problem exists for page boundaries, with a different error
>> message.  I've tried the attached repro scripts on 9.3.0, 9.3.5, 9.4.1
>> and master with the same results:
>>
>> FATAL:  could not access status of transaction 2048
>> DETAIL:  Could not read from file "pg_multixact/offsets/0000" at
>> offset 8192: Undefined error: 0.
>>
>> FATAL:  could not access status of transaction 131072
>> DETAIL:  Could not open file "pg_multixact/offsets/0002": No such file
>> or directory.
>
> So I checked this bug against current master, because it's claimed to be
> closed.  The first script doesn't emit a message at all; the second
> script does emit a message:
>
> LOG:  could not truncate directory "pg_multixact/offsets": apparent wraparound
>
> If you start and stop again, there's no more noise in the logs.  That's
> pretty innocuous -- great.

Right, I included a fix for this in
https://commitfest.postgresql.org/5/265/ which handles both
pg_subtrans and pg_multixact, since it was lost in the noise in this
thread...  Hopefully someone can review that.

> But then I modified your script to do two segments instead of one.  Then
> after the second cycle is done, start the server and stop it again.  The
> end result is a bit surprising: you end up with no files in
> pg_multixact/offsets at all!

Ouch.  I see why: latest_page_number gets initialised to a different
value when you restart (computed from oldest multixact ID, whereas
during normal running it remembers the last created page number), so
in this case (next == oldest, next % 2048 == 0), restarting the server
moves latest_page_number forwards by one, so SimpleLruTruncate no
longer bails out with the above error message and it happily deletes
all files.  That is conceptually OK (there are no multixacts, so no
files should be OK), but see below...  Applying the page linked above
prevents this problem (it always keeps at least one multixact and
therefore at least one page and therefore at least one segment,
because it steps back one multixact to avoid boundary problems when
oldest == next).

As for whether it's actually OK to have no files in
pg_multixact/offsets, it seems that if you restart *twice* after
running checkpoint-segment-boundary.sh, you finish up with earliest =
4294965248 in TruncateMultiXact, because this code assumes that there
was at least one file found and then proceeds to assign (-1 * 2048) to
earliest (which is unsigned).

        trunc.earliestExistingPage = -1;
        SlruScanDirectory(MultiXactOffsetCtl,
SlruScanDirCbFindEarliest, &trunc);
        earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
        if (earliest < FirstMultiXactId)
                earliest = FirstMultiXactId;

I think this should bail out if earliestExistingPage is still -1 after
the call to SlruScanDirectory.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general