Corruption with PITR restore

Jorge Torralba <jorge.torralba@xxxxxxxxx> · Tue, 12 Mar 2013 18:28:39 -0700

Experiencing serious issues with PITR restore.

We are using postgres 8.3.9 on centos 5.6. We have archiving turned on and write to an NFS share with a very simple archive  command of 

"cp -i %p /path/to/nfsshare/%f </dev/null"

we execute  

select pg_start_backup('mylabel');

once we get the succes,

we tar up the cluster dir

when completed, we 

select pg_stop_backup()

and go on our merry way.

The other night we migrated to a new environment and copied the tar file to the new environment and extracted it there.  We shut down the existing postgres on the old environment and copied the archived wal files and the files in the pg_xlog to the new server. This process took place 3 days after the initial tar was taken. By this time we had about 1200 wal files. We replaced the pg_xlog files in the new env with the ones we just copied from the shut down server. we had our recovery.conf file simply pointing to the archive wal directory with no target time and started postgres. Sure enough, all the wal files played and we got our database is ready to accept connections. We tested and everything looked fine.

All hell broke lose the next day, missing chunk 0, unexpected chunk, bad siblings, chunks in toast table screwed up etc ... It has been a nightmare. Could not even execute a pg_dumpall. had to spend days looking for rows and updating them so eventually the pg_dump worked. I turned on the old server for validating and the corruption was not there.

What has caused this ? our wal sync method is by default set to fdatasync. The original server was on a red hat cluster with a GFS file system and the server could never shut down gracefully when the sysadmins shut it down. It was always an immediate. This is because of the cluster config which we moved off of.

Any help would be appreciated.

Thanks!!!

JT