"Stand-in" server recovery techniques

"Kevin Grittner" <Kevin.Grittner@xxxxxxxxxxxx> · Wed, 22 Aug 2007 15:17:08 -0500

I'm looking for advice on how to best switch from warm standby processing to
"stand-in" production use in our (rather unusual) environment.

THE ENVIRONMENT

We have 72 counties spread around the state, each with their own database,
which contains the official data for that county's court system.  We do PITR
backups to the file server in each county, and replicate the files (both
base backups and WAL files) back to a single central machine which hosts a
"farm" of warm standby PostgreSQL instances -- one for each county, all on
the same machine.  This serves to validate the integrity of the files
brought back from the counties.

Base backups are done weekly, and the WAN speed to a county depends
primarily on the size of its base backup -- we try to get a speed sufficient
to bring back the base backup within 24 hours.  We force WAL files to be
written at least once per hour, although many counties generate them much
more frequently during the work day.

It turns out that 72 warm standby instances can keep a machine pretty busy,
so we won't want to use that machine for production.  We have two machines
ready to use for temporary "stand-ins" over the WAN in the event of a
catastrophic failure of a county machine.  We will get a database going on
one of these, in production mode.  This would only be done after recovery
efforts on the county machine have failed, and that machine has been powered
down.

We will bring the warm standby machine as current as we can using WAL files,
copy it to a stand-in machine, start our application middle tier and "top
off" the stand-in database with transactions from the transaction repository
our framework maintains (normally current to within a second).

Once this stand-in is running, we will want to become the source of WAL
files which can be applied to the warm standby on "the farm".

THE QUESTION

How do we create the PostgreSQL instance on the stand-in box?

I see four possibilities:

(1)  Restore the latest base backup and apply all WAL files available.
This is likely to be the slowest option.

(2)  Kick the warm standby on "the farm" into production mode, shut down
the instance, and then copy the instance directory.  This should be
relatively quick and safe, but has the down side of needing to restart
the warm standby from the latest base backup afterwards, if that is even
possible.  It seems like we might need to make a fresh base backup from the
(stopped) instance on the warm standby farm.

(3)  Stop the warm standby while it is in recovery mode, copy the instance
directory, and restart it.  On the stand-in box, have the script specified
by recovery.conf just "exit 1" to kick it immediately into production mode.
As long as the WAL files generated in this situation work with the old warm
standby, I don't see a down side, but I'm not sure if that is a safe
assumption.

(4)  Capture information about where the warm standby is at and attempt a
PITR-style copy of the instance while the warm stanby is running.  This one
seems riskier than (3) without significant benefit.

So, I'm leaning toward option (3).  Does that sound safe and workable?  Will
the WAL files from the stand-in work with the old warm standby?  Did I miss
any good alternatives?

Thanks,

-Kevin

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

                http://www.postgresql.org/about/donate