The Problem of Applying Point-in-time Recovery

Shih Théo <galaxyshih@xxxxxxxxx> · Sat, 20 Jun 2015 18:11:56 +0800

Dear Sir/Madam, 
I am not sure if is it proper to post my problem here. If not, please forgive my ignorance and tell me where should I post to.

Recently, I am applying point-in-time recovery with Debian and postgres 8.3 (due to some reason, I have no chance to upgrade) but I encountered some problems. I followed the instruction from the official document step by step. 

1. First, I modified the postgres.conf to enable WAL arching and restart the postgres. 
2. Then I simply tar the whole data in the cluster data directory, ${PG_DATA} to be the base backup. During this step, I called pg_start_backup('label') and pg_stop_backup() before and after the tar procedure separately. 
3. After that, I inserted some data into the database. 
4. Next, I simulated that the database is corrupted and need to perform recover.
    4.0. stop postgres
    4.1. moved the WALs from ${PG_DATA}/pg_xlog to another directory
    4.2. untared the  base backup and moved the data to ${PG_DATA} (overwrite it)
    4.3. created recovery.conf, following is my configuration: (Note, I stored the WALs to a remote host)
               restore_command = 'rsync -a host_user@host_ip:/path/to/remote/host/wal/%f %p'
               recovery_target_time = 'YYYY-mm-dd HH:MM:SS'
               recovery_target_timeline = 'value'          
    4.4. restarted postgres

At first, everything was fine. I could perform recover successfully. I could see from log that postgres did restore the WALs and I could see the data which i inserted in step 3 in database, too. But when I performed recover repeatedly (that is I repeatedly performed from step 4.0 to step 4.4). I got very high possibility that postgres could fail to recover. Here is the error message: 

2015-06-18 20:22:02 GMT+8 LOG:  restored log file "00000001000000000000002E.00000020.backup" from archive
2015-06-18 20:22:03 GMT+8 LOG:  restored log file "00000001000000000000002E" from archive
2015-06-18 20:22:03 GMT+8 LOG:  unexpected pageaddr 0/2A000000 in log file 0, segment 46, offset 0
2015-06-18 20:22:03 GMT+8 LOG:  invalid checkpoint record
2015-06-18 20:22:03 GMT+8 FATAL:  could not locate required checkpoint record
2015-06-18 20:22:03 GMT+8 HINT:  If you are not restoring from a backup, try removing the file "/home/genie/db_mount_point/backup_label".
2015-06-18 20:22:03 GMT+8 LOG:  startup process (PID 658) exited with exit code 1
2015-06-18 20:22:03 GMT+8 LOG:  aborting startup due to startup process failure

I do not know what caused the problem exactly. Is the problem happened because I performed recover repeatedly? Please give me some suggestion.

Yours faithfully