Search Postgresql Archives

Re: Invalid headers and xlog flush failures

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Bricklen Anderson wrote:
Tom Lane wrote:

Bricklen Anderson <BAnderson@xxxxxxxxxxxx> writes:

Tom Lane wrote:

I would have suggested that maybe this represented on-disk data
corruption, but the appearance of two different but not-too-far-apart
WAL offsets in two different pages suggests that indeed the end of WAL
was up around segment 972 or 973 at one time.



Nope, never touched pg_resetxlog.
My pg_xlog list ranges from 000000010000007300000041 to 0000000100000073000000FE, with no breaks. There are also these: 000000010000007400000000 to 00000001000000740000000B



That seems like rather a lot of files; do you have checkpoint_segments set to a large value, like 100? The pg_controldata dump shows that the latest checkpoint record is in the 73/41 file, so presumably the active end of WAL isn't exceedingly far past that. You've got 200 segments prepared for future activity, which is a bit over the top IMHO.

But anyway, the evidence seems pretty clear that in fact end of WAL is
in the 73 range, and so those page LSNs with 972 and 973 have to be
bogus.  I'm back to thinking about dropped bits in RAM or on disk.
IIRC these numbers are all hex, so the extra "9" could come from just
two bits getting turned on that should not be.  Might be time to run
memtest86 and/or badblocks.

regards, tom lane


Yes, checkpoint_segments is set to 100, although I can set that lower if you feel that that is more appropriate. Currently, the system receives around 5-8 million inserts per day (across 3 primary tables), so I was leaning towards the "more is better" philosophy.

We ran e2fsck with badblocks option last week and didn't turn anything up, along with a couple of passes with memtest. I will run a full-scale memtest and post any interesting results.

I've also read that kill -9 postmaster is "not a good thing". I honestly can't vouch for whether or not this may or may not have occurred around the time of the initial creation of this database. It's possible, since this db started it's life as a development db at 8r3 then was bumped to 8r5, then on to 8 final where it has become a dev-final db.

Assuming that the memtest passes cleanly, as does another run of badblocks, do you have any more suggestions on how I should proceed? Should I run for a while with zero_damaged_pages set to true and accpet the data loss, or just recreate the whole db from scratch?


memtest86+ ran for over 15 hours with no errors reported. e2fsck -c completed with no errors reported.

Any ideas on what I should try next? Considering that this db is not in production yet, I _do_ have the liberty to rebuild the database if necessary. Do you have any further recommendations?

thanks again,

Bricklen

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@xxxxxxxxxxxxxx

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux