Re: corruption issue after server crash - ERROR: unexpected chunk number 0

Mike Broers <mbroers@xxxxxxxxx> · Thu, 21 Nov 2013 16:30:50 -0600

Thanks for the response.  fsync and full_page_writes are both on.  
Our database runs on a managed hosting provider's vmhost server/san, I can possibly request for them to provide some hardware test results - do you have any specifics diagnostics in mind?  The crash was apparently due to our vmhost suddenly losing power, the only row that it has complained with the chunk error also migrated into both standby servers, and as previously stated was fixed with a reindex of the parent table in one of the standby servers after taking it out of recovery.  The vacuumdb -avz on this test copy didnt have any errors or warnings, im going to also run a pg_dumpall on this host to see if any other rows are problematic. 

Is there something else I can run to confirm we are more or less ok at the database level after the pg_dumpall or is there no way to be sure and a fresh initdb is required. 

I am planning on running the reindex in actual production tonight during our maintenance window, but was hoping if that worked we would be out of the woods.  

On Thu, Nov 21, 2013 at 3:56 PM, Kevin Grittner <kgrittn@xxxxxxxxx> wrote:

Mike Broers <mbroers@xxxxxxxxx> wrote:

> Hello we are running postgres 9.2.5 on RHEL6, our production

> server crashed hard and when it came back up our logs were

> flooded with:

> ERROR:  unexpected chunk number 0 (expected 1) for toast value 117927127 in pg_toast_19122

Your database is corrupted.  Unless you were running with fsync =

off or full_page_writes = off, that should not happen.  It is

likely to be caused by a hardware problem (bad RAM, a bad disk

drive, or network problems if your storage is across a network).

If it were me, I would stop the database service and copy the full

data directory tree.

http://wiki.postgresql.org/wiki/Corruption

If fsync or full_page_writes were off, your best bet is probably to

go to your backup.  If you don't go to a backup, you should try to

get to a point where you can run pg_dump, and dump and load to a

freshly initdb'd cluster.

If fsync and full_page_writes were both on, you should run hardware

diagnostics at your earliest opportunity.  When hardware starts to

fail, the first episode is rarely the last or the most severe.

--

Kevin Grittner

EDB: http://www.enterprisedb.com

The Enterprise PostgreSQL Company