pitr replica dies on startup

Jeff Frost <jeff@xxxxxxxxxxxxxxxxxxxxxx> · Fri, 31 Aug 2007 15:44:57 -0700 (PDT)

Hi guys, I've inherited a PITR continuous recovery master/standby server pair. 
The continuous recovery and loading of the xlogs seems to work fine, but when 
I opted to test the replica bring up, it falls over with signal 6.

Here's an excerpt from the log with log levels set up to debug5:

DEBUG:  max_safe_fds = 984, usable_fds = 1000, already_open = 6
LOG:  database system was interrupted at 2007-08-31 15:09:40 PDT
LOG:  starting archive recovery
LOG:  restore_command = "/data/pg/data/recovery.py /data/pg/wal/%f "%p""
DEBUG:  executing restore command "/data/pg/data/recovery.py 
/data/pg/wal/00000001.history "pg_xlog/RECOVERYHISTORY""
DEBUG:  could not restore file "00000001.history" from archive: return code 0
DEBUG:  executing restore command "/data/pg/data/recovery.py 
/data/pg/wal/00000001000000000000004C.00000020.backup 
"pg_xlog/RECOVERYHISTORY""
LOG:  restored log file "00000001000000000000004C.00000020.backup" from 
archive
DEBUG:  executing restore command "/data/pg/data/recovery.py 
/data/pg/wal/00000001000000000000004C "pg_xlog/RECOVERYXLOG""
LOG:  restored log file "00000001000000000000004C" from archive
LOG:  checkpoint record is at 0/4C000020
LOG:  redo record is at 0/4C000020; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 0/1222; next OID: 24576
LOG:  next MultiXactId: 1; next MultiXactOffset: 0
LOG:  automatic recovery in progress
LOG:  redo starts at 0/4C000070
DEBUG:  executing restore command "/data/pg/data/recovery.py 
/data/pg/wal/00000001000000000000004D "pg_xlog/RECOVERYXLOG""
LOG:  restored log file "00000001000000000000004D" from archive
DEBUG:  recovery restart point at 0/4D000020
CONTEXT:  xlog redo checkpoint: redo 0/4D000020; undo 0/0; tli 1; xid 0/1232; 
oid 24576; multi 1; offset 0; online
DEBUG:  executing restore command "/data/pg/data/recovery.py 
/data/pg/wal/00000001000000000000004E "pg_xlog/RECOVERYXLOG""
LOG:  restored log file "00000001000000000000004E" from archive
DEBUG:  recovery restart point at 0/4E000020
CONTEXT:  xlog redo checkpoint: redo 0/4E000020; undo 0/0; tli 1; xid 0/1242; 
oid 24576; multi 1; offset 0; online
DEBUG:  executing restore command "/data/pg/data/recovery.py 
/data/pg/wal/00000001000000000000004F "pg_xlog/RECOVERYXLOG""
LOG:  restored log file "00000001000000000000004F" from archive
DEBUG:  recovery restart point at 0/4F000020
CONTEXT:  xlog redo checkpoint: redo 0/4F000020; undo 0/0; tli 1; xid 0/1252; 
oid 24576; multi 1; offset 0; online
DEBUG:  executing restore command "/data/pg/data/recovery.py 
/data/pg/wal/000000010000000000000050 "pg_xlog/RECOVERYXLOG""
DEBUG:  could not restore file "000000010000000000000050" from archive: return 
code 256
LOG:  could not open file "pg_xlog/000000010000000000000050" (log file 0, 
segment 80): No such file or directory
LOG:  redo done at 0/4F000070
DEBUG:  executing restore command "/data/pg/data/recovery.py 
/data/pg/wal/00000001000000000000004F "pg_xlog/RECOVERYXLOG""
DEBUG:  could not restore file "00000001000000000000004F" from archive: return 
code 256
PANIC:  could not open file "pg_xlog/00000001000000000000004F" (log file 0, 
segment 79): No such file or directory
DEBUG:  reaping dead processes
LOG:  startup process (PID 31211) was terminated by signal 6
LOG:  aborting startup due to startup process failure
DEBUG:  proc_exit(1)
DEBUG:  shmem_exit(1)
DEBUG:  exit(1)

You can see the rest of the log file here:

http://pgsql.privatepaste.com/841mOR5gzm

It seems like everything is happy, except it seems to ask for the 4F log file 
more than once.

Directory permissions look good:

total 45
-rw------- 1 postgres postgres     4 Aug 31 11:13 PG_VERSION
-rw------- 1 postgres postgres   167 Aug 31 15:09 backup_label.old
drwx------ 5 postgres postgres   120 Aug 31 14:59 base
drwx------ 2 postgres postgres   768 Aug 31 15:08 global
drwx------ 2 postgres postgres    72 Aug 31 11:13 pg_clog
-rw------- 1 postgres postgres  3279 Aug 31 11:16 pg_hba.conf
-rw------- 1 postgres postgres  1460 Aug 31 11:14 pg_ident.conf
drwx------ 4 postgres postgres    96 Aug 31 11:13 pg_multixact
drwx------ 2 postgres postgres    72 Aug 31 11:13 pg_subtrans
drwx------ 2 postgres postgres    48 Aug 31 11:13 pg_tblspc
drwx------ 2 postgres postgres    48 Aug 31 11:13 pg_twophase
drwx------ 3 postgres postgres   112 Aug 31 15:24 pg_xlog
-rw-r--r-- 1 postgres postgres 15370 Aug 31 14:57 postgresql.conf
-rw------- 1 postgres postgres    35 Aug 31 15:09 postmaster.opts
-rw-r--r-- 1 postgres postgres    67 Aug 31 13:40 recovery.conf
-rwxr-xr-x 1 postgres postgres  1384 Aug 31 13:40 recovery.py

I verified that the files are really being copied into 
pg_xlog/RECOVERYXLOG.

This is 8.2.4 on Gentoo, but I observed the same behavior with 8.2.3 on 
Gentoo.

We've md5sum'd the postgres binary, diffed the output of pg_config and 
everything is the same.

I tried starting up postgres in a newly initdb'd directory and it starts up 
fine.  I also tried using the conf files from the 'broken' directory in this 
new directory and it starts up fine, but if I try to replicate into this 
directory, postgres exits with signal 6 just the same.

Conversely, using a default postgresql.conf file in the 'broken' directory 
leaves it just as broken.

The master can start/stop just fine with no problems.

What else to look at?  I do have a core file, back trace can be found here:

http://pgsql.privatepaste.com/8f0WDAAzz9

Any suggestions much appreciated!

--
Jeff Frost, Owner 	<jeff@xxxxxxxxxxxxxxxxxxxxxx>
Frost Consulting, LLC 	http://www.frostconsultingllc.com/
Phone: 650-780-7908	FAX: 650-649-1954

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match