Recovery from PITR corrupted

Brian Fehrle <brianf@xxxxxxxxxxxxxxxxxxx> · Tue, 02 Oct 2012 18:13:41 -0600

Hi All,

INFO:
postgresql 9.0.4

I have a backup setup that is having an issue when I restore a backup 
for a development environment. I received some great help on the 
postgresql IRC channel on freenode, but unable to track down the exact 
cause thus far (some of my logging and backup files are now gone), so I 
thought I'd throw it out here to see if anyone has some additional input.

The backup system is as follows:

1. Master cluster exports WAL files and ships them to a backup server

2. The backup server then ships those WAL files to a warm standby server 
after they are 1 hour old (this is so that in a horrific data deletion 
happens, we can bring the warmstandby online within an hour and be 
online without the lengthy recovery time).

3. Once per night, we bring down the warm standby with a 'pg_ctl stop -m 
fast'. We then verify that the warm standby database is OFFLINE and 
preform a backup of the data directory, shipping that off to the backup 
server.

With this backup process, we've recovered several databases without 
issue, but today I've got one that is causing some issues.

Restore process:
1. Uncompress the tarball of the data directory

2. Stage WAL files in a location that pg_standby is looking for

3. Bring the cluster online, verify it's ingesting WAL files in standby 
mode (via pg_standby)

4. Create trigger file to signal the standby to come out of recovery mode.

After uncompressing the data directory from the backup from the 
warmstandby, then staging all the WAL files that were also archived, the 
database reached a consistent recovery state, and came online accepting 
connections. I can connect and issue queries. I then started to set up a 
hot standby for this cluster, and when I executed a pg_start_backup, i 
received the following error in the postgres logs:

Oct  2 20:05:44 localhost postgres[14030]: [1-1] user=,db= ERROR:  xlog 
flush request 79D6/2DB52998 is not satisfied --- flushed only to 
79D5/DC000020
Oct  2 20:05:44 localhost postgres[14030]: [1-2] user=,db= CONTEXT:  
writing block 9018 of relation base/2651908/1059795387
Oct  2 20:05:44 localhost postgres[22850]: [2-1] 
user=postgres,db=postgres ERROR:  checkpoint request failed
Oct  2 20:05:44 localhost postgres[22850]: [2-2] 
user=postgres,db=postgres HINT:  Consult recent messages in the server 
log for details

I tracked down the table itself that it's pointing to, and it's a 
relatively small table. I can query all data from it without error. Just 
guessing at options, I executed a reindex on the table, and got the 
following output over 100 times a second:
WARNING:  concurrent delete in progress within table "mytablename"

(I also got the following a bit later on)
ERROR:  index "pg_depend_depender_index" contains unexpected zero page 
at block 40087
autovacuum: found orphan temp table 
"pg_temp_28"."#DB_7716_INITIAL_SIZE_CHECK" in database "mydatabase"

I am currently preforming a new restore from a more recent backup (last 
night), and will be collecting stats and logs as I go so that I don't 
lose them.

Any thoughts on what this could be, or more good data to collect during 
my second recovery attempt here?

Currently I'll be collecting:
* postgres logs of the warm standby as it comes down, and back online 
after backup is taken
* my logs and timestamps of when the backup took place
* postgres logs from the recovery database as it comes online
* pg_controldata output for A. master database B. warm standby database 
C. recovery database before and after I actually bring it online.

Thanks in advance,
- Brian F

--
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin