On May 7, 2007, at 3:10 PM, Thomas F. O'Connell wrote:
I'm attempting to design a postgres system whereby an authoritative
primary server simultaneously feeds continuous archives to a number
of warm standby servers that live both on the local network and on
remote networks.
The sticking point in my current thinking about such a system is
what to do in the event that any of an array of possible nodes
becomes unreachable. I would expect a custom archive_command to
have the intelligence about network reachability and to report a
nonzero status if it was unable to submit an archive to any
particular node.
The way I understand it, postgres would then resubmit the file that
caused the nonzero status, which, if connectivity has been
restored, is no problem for the node that caused the nonzero status
in the first place. But then the issue becomes what to do with the
nodes that were fine when the nonzero status.
From the docs <http://www.postgresql.org/docs/8.2/static/continuous-
archiving.html#BACKUP-ARCHIVING-WAL>:
"It is advisable to test your proposed archive command to ensure
that it indeed does not overwrite an existing file, and that it
returns nonzero status in this case. We have found that cp -i does
this correctly on some platforms but not others. If the chosen
command does not itself handle this case correctly, you should add
a command to test for pre-existence of the archive file."
What is the advised remedy for this scenario in general? And then
what is it if nonzero status is returned by archive_command because
the file already exists on nodes that stayed up after a scenario
where nonzero status is returned because one or more nodes became
unreachable?
AFAIK the bit about -i / not overwriting files is just a safety
measure to ensure you don't accidentally set up the archive_command
to over-write WAL files for a working backend, or some other PITR
backup. As long as you're certain you've got all your paths setup
correctly it should be safe to drop the -i bit.
A follow-on question is: Does it become the responsibility of
archive_command in a scenario like this to track which files have
been archived on which nodes? Is there any introspective way for a
standby server to know that a file has been archived by primary? If
not, is it safe to reply on using sequential numbering of WAL files
for implicit introspection? I don't see any functions that provide
introspection of this nature. I ask because it seems like network-
to-network failures are a common enough occurrence that some
mechanism for archive verification is a must-have. I'm just trying
to determine how much of that functionality I'll have to build
myself...
I don't think you'd have to have the archive_command track each copy
of each WAL file... but you might want to do that anyway, just for
status information. If only we had a database to store that info
in... ;)
As for WAL file numbering, even if there were some things you could
deduce from the file numbers, I wouldn't want to bet on that. Future
changes to the WAL code could change the naming scheme, which could
result in subtle breakage to your system (granted, odds of that
happening are low, but still...)
--
Jim Nasby jim@xxxxxxxxx
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)