Having trouble restoring our backups

Bryan Murphy <bmurphy1976@xxxxxxxxx> · Fri, 12 Jun 2009 10:24:34 -0500

Hey guys, I'm having difficulty restoring some of our backups.  Luckily, I'm only trying to do this to bring up a copy of our database for testing purposes, but this still has me freaked out because it means we currently have no valid backups and are only running with a single warm spare.

Our primary database is on a RAID-10 that can't take snapshots and is very overworked, so we ship our wal files to a warm standby server.  Every day or two I log in to the warm standby and run the following commands:

1. xfs_freeze -f /srv   (this is where the entire postgres tree is mounted, no funny business with symlinks)
2. * take file system snapshot, wait about 30 seconds for snapshot to start running *
3. xfs_freeze -u /srv

I don't exactly know how the snapshotting works (it's an Amazon EBS volume), so I don't know if I should wait until the snapshotting is 100% complete before I unfreeze the volume.  This whole process can easily take 30 minutes to an hour, so I am also concerned that if I wait that long to unfreeze the volume I may cause an excessive backlog of wal files that are not getting applied to the warm spare.

Now, when I try to restore one of these snapshots, I do the following:

1. create new share from snapshot
2. mount new share in new Linux instance
3. start postgres, verify that it's running and is still in recovery mode
4. touch my go live file and bring the database up

I've done this successfully in the past.  Today, however, I'm running into this problem when I try to run some queries:

ERROR:  could not access status of transaction 237546265
DETAIL:  Could not open file "pg_clog/00E2": No such file or directory.

I tried creating the missing files last night using dd, and I was able to get the database to a point where I was able to run queries against it, however it was missing data that should have been there.  I tried again this morning with a different snapshot and I've run into the same problem again.

What am I doing wrong?  FYI, we're running 8.3.7.

Thanks,
Bryan