Re: High Availability with Postgres

John R Pierce <pierce@xxxxxxxxxxxx> · Mon, 21 Jun 2010 12:39:19 -0700

On 06/21/10 12:23 PM, Dimitri Fontaine wrote:
John R Pierce<pierce@xxxxxxxxxxxx>  writes:

Two DB servers will be using a common external storage (with raid).

This is also one of the only postgres HA configurations that won't lose
/any/ committed transactions on a failure.  Most all PITR/WAL
replication/Slony/etc configs, the standby storage runs several seconds
behind realtime.

I'm not clear on what error case it protects against, though. Either the
data is ok and a single PostgreSQL system will restart fine, or the data
isn't and you're hosed the same with or without the second system.

What's left is hardware failure that didn't compromise the data. I
didn't see much hardware failure yet, granted, but I'm yet to see a
motherboard, some RAM or a RAID controller failing in a way that leaves
behind data you can trust.

in most of the HA clusters I've seen, the raid controllers are in the 
SAN, not in the hosts, and they have their own failover, with shared 
write cache, also extensive use of ECC so things like double-bit memory 
errors are detected and treated as a failure.   the sorts of high end 
SANs used in these kinds of systems have 5-9's reliability, through 
extensive use of redundancy, dual port disks, fully redundant 
everything, mirrored caches, etc.

ditto, the servers used in these sorts of clusters have ECC memory, so 
memory failure should be detected rather than passed on blindly in the 
form of corrupted data.   Server grade CPUs, especially the RISC ones, 
have extensive ECC internally on their caches, data busses, etc, so any 
failure there is detected rather than allowed to corrupt data.  failure 
modes can include things like failing fans (which will be detected, 
resulting in a server shutdown if too many fail), power supply failure 
(redundant PSUs, but I've seen the power combining circuitry fail).   
Any of these sorts of failures will result in a failover without 
corrupting the data.

and of course, intentional planned failovers to do OS maintenance...  
you patch the standby system, fail over to it and verify its good, then 
patch the other system.

We had a large HA system at an overseas site fail over once due to 
flooding in the primary computer room caused by a sprinkler system 
failure upstairs.   The SAN was mirrored to a SAN in the 2nd DC (fiber 
inteconnected) and the backup server was also in the second DC across 
campus, so it all failed over gracefully.   This particular system was 
large Sun hardware and big EMC storage, and it was running Oracle rather 
than Postgres.   We've had several big UPS failures at various sites, 
too, ditto HVAC, over a 15 year period.

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general