Re: CLOG read problem after pg_basebackup

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Sat, 24 Jan 2015 07:26:41 -0800

On 01/23/2015 05:18 PM, David G Johnston wrote:
Petr Novak wrote
Three of them failed to start after pg_basebackup completed with:

FATAL:  could not access status of transaction 923709700
DETAIL:  Could not read from file "pg_clog/0370" at offset 237568:
Success.

(the clog file differed in each case of course..)

As for PG versions one is 9.1.14 (on both master and replica) and the
other
two 9.2.9 (also on both)

To clarify, the pg_basebackup against the master failed with the above
message?

You confirmed that the archive did not contain the named clog file?

But when you went and checked the running cluster's pg_clog directory the
file was present?

In the initial post the OP said that the pg_clog file was present on 
both the master and replica and that copying the presumably updated file 
from the master, after the pg_basebackup, to the replica 'cured' the 
problem. This would seem to explain the could not read from offset 
error. Initially the replicated Postgres was looking for data in the 
pg_clog file at an offset that existed only in the file version on the 
master. Once the replica was provided with the updated file it was 
happy. The question being how it got in that state?  Your observations 
below are better then anything I could come up with.

What was the timestamp of the file in the running cluster relative to the
start of the pg_basebackup?

Did you attempt another pg_basebackup against any of the failing servers -
i.e., is the error now a constant for the server or was it transient?

I am somewhat at a loss to explain how pg_basebackup works with pg_clog
given this quote from the wiki:

https://wiki.postgresql.org/wiki/Hint_Bits

"CLOG pages don't make their way out to disk until the internal CLOG buffers
are filled, at which point the least recently used buffer there is evicted
to permanent storage."

Either pg_clog should be course-corrected by WAL, in which case you
shouldn't get a fatal error if an incomplete clog file is found to exist, or
there must something being done to avoid a race condition in this area.  If
that isn't happening then your error could potentially be explained - though
damn bad luck getting it on three servers...

The last observation leads one to wonder if there some kind of transaction
volume or I/O difference that makes the failing servers special (more prone
to getting hit by said race condition)?

I may be just blowing smoke here but maybe it will spark an idea in someone
more knowledgeable.

David J.

--
View this message in context: http://postgresql.nabble.com/CLOG-read-problem-after-pg-basebackup-tp5835204p5835296.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general