Re: 9.0.4 Data corruption issue

Ken Caruso <ken@xxxxxxxxx> · Tue, 19 Jul 2011 12:27:22 -0700

On Sun, Jul 17, 2011 at 3:04 AM, Cédric Villemain <cedric.villemain.debian@xxxxxxxxx> wrote:

2011/7/17 Ken Caruso <ken@xxxxxxxxx>:

>

>

> On Sat, Jul 16, 2011 at 2:30 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:

>>

>> Ken Caruso <ken@xxxxxxxxx> writes:

>> > Sorry, the actual error reported by CLUSTER is:

>>

>> > gpup=> cluster verbose tablename;

>> > INFO:  clustering "dbname.tablename"

>> > WARNING:  could not write block 12125253 of base/2651908/652397108

>> > DETAIL:  Multiple failures --- write error might be permanent.

>> > ERROR:  could not open file "base/2651908/652397108.1" (target block

>> > 12125253): No such file or directory

>> > CONTEXT:  writing block 12125253 of relation base/2651908/652397108

>>

>> Hmm ... it looks like you've got a dirty buffer in shared memory that

>> corresponds to a block that no longer exists on disk; in fact, the whole

>> table segment it belonged to is gone.  Or maybe the block or file number

>> in the shared buffer header is corrupted somehow.

>>

>> I imagine you're seeing errors like this during each checkpoint attempt?

>

> Hi Tom,

> Thanks for the reply.

> Yes, I tried a pg_start_backup() to force a checkpoint and it failed due to

> similar error.

>

>>

>> I can't think of any very good way to clean that up.  What I'd try here

>> is a forced database shutdown (immediate-mode stop) and see if it starts

>> up cleanly.  It might be that whatever caused this has also corrupted

>> the back WAL and so WAL replay will result in the same or similar error.

>> In that case you'll be forced to do a pg_resetxlog to get the DB to come

>> up again.  If so, a dump and reload and some manual consistency checking

>> would be indicated :-(

>

> Before seeing this message, I restarted Postgres and it was able to get to a

> consistent state at which point I reclustered the db without error and

> everything appears to be fine. Any idea what caused this? Was it something

> to do with the Vacuum Full?

Block number 12125253 is bigger that any block we can find in

base/2651908/652397108.1

Should the table size be in the 100GB range or 2-3 GB range ?

This should help decide: if in the former case, then probably at least

a segment disappear or, in the later, the shared_buffer turn

corrupted.

The DB was in the 200GB-300GB range when this happened. What would cause the segment to go missing? Just wondering if there is any further action I should take like filing a bug or if this is a known issue.  Thanks for everyone's help.

-Ken

Ken, you didn't change RELSEG_SIZE, right ? (it needs to be change in

source code before compile it yourself)

In both case a hardware check is welcome I believe.

--

Cédric Villemain               2ndQuadrant

http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support