Search Postgresql Archives

Re: High Availability with Postgres

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 06/21/10 12:23 PM, Dimitri Fontaine wrote:
John R Pierce<pierce@xxxxxxxxxxxx>  writes:
Two DB servers will be using a common external storage (with raid).
This is also one of the only postgres HA configurations that won't lose
/any/ committed transactions on a failure.  Most all PITR/WAL
replication/Slony/etc configs, the standby storage runs several seconds
behind realtime.
I'm not clear on what error case it protects against, though. Either the
data is ok and a single PostgreSQL system will restart fine, or the data
isn't and you're hosed the same with or without the second system.

What's left is hardware failure that didn't compromise the data. I
didn't see much hardware failure yet, granted, but I'm yet to see a
motherboard, some RAM or a RAID controller failing in a way that leaves
behind data you can trust.

in most of the HA clusters I've seen, the raid controllers are in the SAN, not in the hosts, and they have their own failover, with shared write cache, also extensive use of ECC so things like double-bit memory errors are detected and treated as a failure. the sorts of high end SANs used in these kinds of systems have 5-9's reliability, through extensive use of redundancy, dual port disks, fully redundant everything, mirrored caches, etc.

ditto, the servers used in these sorts of clusters have ECC memory, so memory failure should be detected rather than passed on blindly in the form of corrupted data. Server grade CPUs, especially the RISC ones, have extensive ECC internally on their caches, data busses, etc, so any failure there is detected rather than allowed to corrupt data. failure modes can include things like failing fans (which will be detected, resulting in a server shutdown if too many fail), power supply failure (redundant PSUs, but I've seen the power combining circuitry fail). Any of these sorts of failures will result in a failover without corrupting the data.

and of course, intentional planned failovers to do OS maintenance... you patch the standby system, fail over to it and verify its good, then patch the other system.

We had a large HA system at an overseas site fail over once due to flooding in the primary computer room caused by a sprinkler system failure upstairs. The SAN was mirrored to a SAN in the 2nd DC (fiber inteconnected) and the backup server was also in the second DC across campus, so it all failed over gracefully. This particular system was large Sun hardware and big EMC storage, and it was running Oracle rather than Postgres. We've had several big UPS failures at various sites, too, ditto HVAC, over a 15 year period.



--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux