On Wed, 20 Aug 2008, Tom Lane wrote:
Greg Smith <gsmith@xxxxxxxxxxxxx> writes:
You also don't want to be the guy who has to explain why the database is
taking hours to come back up again after it crashed and has 4000 WAL
segments to replay, because archiving failed for a long time and prevented
proper checkpoints (ask Robert Treat if you don't believe me, he also once
was that guy).
Say what? Archiver failure can't/shouldn't prevent checkpointing.
Shouldn't, sure. The wacky case Robert ran into I was alluding to
involved the system not checkpointing anymore and just piling the archive
files up, and while I think it's safe to say that was all a hardware
problem stuff like that makes me nervous.
It is true that archiver failure prevents *normal* checkpointing, where
WAL files get recycled rather than piling up. I know that shouldn't make
any difference, but I've also been through two similarly awful situations
resulting from odd archiver problems that seemed mysterious at the time
(staring at the source later cleared up what really happened) that left me
even more paranoid than usual when working in this area.
The stance I've adopted says anything involving uncertain network
resources should get moved to outside of the code the database itself
runs. Any time you're following a different path than the usual one
through the server code (in this case exercising the archive failure and
resubmission section), I see that as an opportunity to run into more
obscure bugs. That's just not code that gets run/tested as often. It
also minimizes the amount of software the admin wrote that has to be right
(bugs in the archive_command script are really bad) in order for the
database to keep running.
--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD