Re: Invalid headers and xlog flush failures

Bricklen Anderson <BAnderson@xxxxxxxxxxxx> · Fri, 04 Feb 2005 07:14:14 -0800

Tom Lane wrote:
Bricklen Anderson <BAnderson@xxxxxxxxxxxx> writes:

Tom Lane wrote:

But anyway, the evidence seems pretty clear that in fact end of WAL is
in the 73 range, and so those page LSNs with 972 and 973 have to be
bogus.  I'm back to thinking about dropped bits in RAM or on disk.

memtest86+ ran for over 15 hours with no errors reported.
e2fsck -c completed with no errors reported.

Hmm ... that's not proof your hardware is ok, but it at least puts the
ball back in play.

Any ideas on what I should try next? Considering that this db is not
in production yet, I _do_ have the liberty to rebuild the database if
necessary. Do you have any further recommendations?

If the database isn't too large, I'd suggest saving aside a physical
copy (eg, cp or tar dump taken with postmaster stopped) for forensic
purposes, and then rebuilding so you can get on with your own work.

One bit of investigation that might be worth doing is to look at every
single 8K page in the database files and collect information about the
LSN fields, which are the first 8 bytes of each page.
Do you mean this line from pg_filedump's results:

LSN:  logid     56 recoff 0x3f4be440      Special  8176 (0x1ff0)

If so, I've set up a shell script that looped all of the files and emitted that line.

It's not particularly elegant, but it worked. Again, that's assuming that it was the correct line.

I'll write a perl script to parse out the LSN values to see if any are greater than 116 (which I 
believe is the hex of 74?).

In case anyone wants the script that I ran to get the LSN:
#!/bin/sh

for FILE in /var/postgres/data/base/17235/*; do
        i=0
        echo $FILE >> test_file;
        while [ 1==1 ]; do
                str=`pg_filedump -R $i $FILE | grep LSN`;
                if [ "$?" -eq "1" ]; then
                        break
                fi
                echo "$FILE: $str" >> LSN_out;
                i=$((i+1));
        done
done

In a non-broken database all of these should be less than or equal to the current ending
WAL offset (which you can get with pg_controldata if the postmaster is
stopped).  We know there are at least two bad pages, but are there more?
Is there any pattern to the bad LSN values?  Also it would be useful to
look at each bad page in some detail to see if there's any evidence of
corruption extending beyond the LSN value.

			regards, tom lane

NB. I've recreated the database, and saved off the old directory (all 350 gigs of it) so I can dig 
into it further.

Thanks again for you help, Tom.

Cheers,

Bricklen

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

              http://archives.postgresql.org