Shutting down a warm standby database in 8.2beta3

Stephen Harris <lists@xxxxxxxxxx> · Fri, 17 Nov 2006 14:57:19 -0500

I'm using 8.2beta3 but I'm asking here before posting to the devel
lists as suggested by that lists guidelines..

First the question, because it might be simple and I'm stupid.  However
I'll then go into detail in case I'm not so silly.

In a database which is in recovery mode waiting on an external script
to provide a log file, how do we do a clean shutdown so that we can then
continue recovery at a later point?

Detail:

I've set up a simple log shipping test between two databases.  This
works well; I can perform transactions on the live server, see the
archive log get shipped to the second machine, see the recovery thread
pick it up and apply the logs and wait for the next.  If I signal the
recovery program to stop then then standby database finally comes live
and I can see my work.

This is all good stuff and I like it.  I used to do this with oracle
many years ago and am now please that we can force log switches (even
better, the DBMS itself can do it!), which was a big thing missing in
earlier versions.

However, there will be times when the standby databse needs shutting
down (eg hardware maintenance, OS patches, whatever) and then bringing
back up in recovery mode.  

I tried the following idea:
  pg_ctl stop -W -m smart
  signal_recovery_program to exit(1)
  wait-for-pid-file
  remove any pg_xlog files 
  recreate recover.conf

This has a problem; because the database will temporarily be "live".

LOG:  restored log file "000000010000000100000038" from archive
LOG:  received smart shutdown request
LOG:  could not open file "pg_xlog/000000010000000100000039" (log file 1, segment 57): No such file or directory
LOG:  redo done at 1/38000070
LOG:  restored log file "000000010000000100000038" from archive
LOG:  archive recovery complete
LOG:  database system is ready
LOG:  shutting down
LOG:  database system is shut down

This means it may have started it's own transaction history and so the
archives from the primary database no longer match.  Mostly this has
minimal impact and the system recovers

eg
LOG:  restored log file "00000001000000010000003B" from archive
LOG:  invalid xl_info in primary checkpoint record
LOG:  using previous checkpoint record at 1/3B000020
LOG:  redo record is at 1/3B000020; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 0/2004; next OID: 26532
LOG:  next MultiXactId: 1; next MultiXactOffset: 0
LOG:  automatic recovery in progress
LOG:  redo starts at 1/3B000070
LOG:  restored log file "00000001000000010000003C" from archive
LOG:  restored log file "00000001000000010000003D" from archive
LOG:  restored log file "00000001000000010000003E" from archive

The 2nd line is a little worrying.

However, occaisionally the system can't recover:

LOG:  restored log file "000000010000000100000039" from archive
LOG:  invalid record length at 1/39000070
LOG:  invalid primary checkpoint record
LOG:  restored log file "000000010000000100000039" from archive
LOG:  invalid xl_info in secondary checkpoint record
PANIC:  could not locate a valid checkpoint record
LOG:  startup process (PID 6893) was terminated by signal 6
LOG:  aborting startup due to startup process failure
LOG:  logger shutting down

I know log 1/39.... is good because if I bring up an older backup and
replay the logs then it goes through cleanly.

Doing a shutdown "immediate" isn't to clever because it actually leaves
the recovery threads running

LOG:  restored log file "00000001000000010000003E" from archive
LOG:  received immediate shutdown request
LOG:  restored log file "00000001000000010000003F" from archive

Oops!

So the question is...  how to cleanly shutdown a recovery instance?

-- 

rgds
Stephen