Re: PITR problem

Erik Jones <erik@xxxxxxxxxx> · Mon, 28 Apr 2008 12:00:52 -0500

On Apr 26, 2008, at 5:11 PM, wstrzalka wrote:

I have some problem with setting up PITR recovery on the database.

I have archive_command set properly and logs are shipping OK. Archive
timeout is also set (5 min).

When performing pg_start_backup the WAL is lets say on position
0000000100000001000000D9, then I start copy database to the second
machine which takes me 30 minutes. In that time archive timeout is
called a few times and those file are shipped properly to the second
host. After DB is succesfully copied i'm calling pg_stop_backup. The
WAL is at the moment on position 0000000100000001000000DE.

In that moment I see on the second machine WAL files from
0000000100000001000000D9 to 0000000100000001000000DE as well as
0000000100000001000000D9.00000020.backup

The problem occurs now when I'm trying to start my standby server in
recovery mode (with pg_standby).

The output from pg_standby:
------------------------------------
Trigger file             : /tmp/pgsql.promote_trigger.5432
Waiting for WAL file     : 00000001.history
WAL file path            : /var/lib/pgsql/incoming_wal/
00000001.history
Restoring to...          : pg_xlog/RECOVERYHISTORY
Sleep interval           : 5 seconds
Max wait interval        : 0 forever
Command for restore      : ln -s -f "/var/lib/pgsql/incoming_wal/
00000001.history" "pg_xlog/RECOVERYHISTORY"
Keep archive history     : 0000000100000001000000DB and later
running restore          : OK

Trigger file             : /tmp/pgsql.promote_trigger.5432
Waiting for WAL file     : 0000000100000001000000D9.00000020.backup
WAL file path            : /var/lib/pgsql/incoming_wal/
0000000100000001000000D9.00000020.backup
Restoring to...          : pg_xlog/RECOVERYHISTORY
Sleep interval           : 5 seconds
Max wait interval        : 0 forever
Command for restore      : ln -s -f "/var/lib/pgsql/incoming_wal/
0000000100000001000000D9.00000020.backup" "pg_xlog/RECOVERYHISTORY"
Keep archive history     : 0000000100000001000000DB and later
running restore          : OK

Trigger file             : /tmp/pgsql.promote_trigger.5432
Waiting for WAL file     : 0000000100000001000000D9
WAL file path            : /var/lib/pgsql/incoming_wal/
0000000100000001000000D9
Restoring to...          : pg_xlog/RECOVERYXLOG
Sleep interval           : 5 seconds
Max wait interval        : 0 forever
Command for restore      : ln -s -f "/var/lib/pgsql/incoming_wal/
0000000100000001000000D9" "pg_xlog/RECOVERYXLOG"
Keep archive history     : 0000000100000001000000DB and later
running restore          : OK
removing "/var/lib/pgsql/incoming_wal/0000000100000001000000D9"
removing "/var/lib/pgsql/incoming_wal/0000000100000001000000DA"

--------------------------------------------------------------------------------------------------------

For the first time I start standby Postgres log says and the postgres
process goes down:
--------------------------------------------------------------------------------------------------------
restored log file "0000000100000001000000D9.00000020.backup" from
archive
could not open file "pg_xlog/0000000100000001000000D9" (log file 1,
segment 217): No such file or directory
invalid checkpoint record
could not locate required checkpoint record
If you are not restoring from a backup, try removing the file "/var/
lib/pgsql/data/backup_label".
startup process (PID 19201) was terminated by signal 6: Aborted
aborting startup due to startup process failure
--------------------------------------------------------------------------------------------------------

When I try to start PG for the second time it just stucks waiting
for ...000D9

In my opinion the problem is that when starting standby PostgresSQL
wants to recovery WAL 0000000100000001000000D9, but first deletes it,
as keep  archive history (%r) param is set to
0000000100000001000000DB

Is it a bug or I'm missing something? I can repeat the scenario with
this big DB. However it's not happening on exactly the same
environment when playing with smaller cluster (copying cluster is
shorter then archive_timeout ).

What is the full pg_standby command string (restore_command=....) in  
your recovery.conf.  It sound's like you have pg_standby set to delete  
archived WALs and possibly have that a little too aggressive.  Do you  
have the -k flag set in your pg_standby call in your restore_command?

Erik Jones

DBA | Emma®
erik@xxxxxxxxxx
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com