Re: Tutorials on high availability Postgresql setup?

Scott Marlowe <scott.marlowe@xxxxxxxxx> · Thu, 7 Oct 2010 01:24:10 -0600

On Thu, Oct 7, 2010 at 12:27 AM, Andy <angelflow@xxxxxxxxx> wrote:
> Is there any tutorials or detailed instructions on how to set up HA postgresql & failover? The documentation (http://www.postgresql.org/docs/9.0/interactive/warm-standby-failover.html) on this topics is pretty scarce.
>
> The scenario I'm most interested in is this:
>
> 2 servers - a master and a hot standby. All writes are sent to master, reads are split between master and hot standby.

To have true redundancy, you need 3 servers.  Just saying.  Otherwise
when one goes down, no more redundancy.

> 1) If the hot standby goes down, how do I redirect reads to the master?

Have a config file for your app that tells it where to go for reads
and writes.  Change the config file to point reads at a different db
if a read slave fails.  What constitutes a failed read slave is kind
of a business decision, so you'll likely have to write your own code
to decide what being down means.

> 2) If the master fails
>    -how do I automatically promote the standby to master and send all reads/writes to the new master?

First you need to decide if you actually want automated failovers.
I've seen automated failovers cause as many problems as they were
supposed to fix, but it can be done.  Keep in mind that on a two db
system, failing over means you lose redundancy.  If your cluster fails
over on a lot of false positives, that's a lot of time with no
redundancy.  If your script isn't written with having only one node in
mind, it might try to failover a second time with no read slave to
promote to master.

Also, you're going to have to come up with what constitutes a failed
master.  30 seconds non-responsive?  5 minutes?  An hour?    If the
problem is that the write master is simply overloaded, then failing
over isn't gonna solve anything, as the now newly promoted master is
going to collapse as well under even heavier load.  It might have been
better to adjust the load factors used to determine where read queries
go to take load off of the master, or to change a setting in your app
that reduces load on the master.  With an overloaded write master,
then failover, then overloaded even worse new write master you've got
a site down, no redundancy, and you need to rebuild your old master as
a read slave to handle the load.

To start with I do not recommend doing automatic failovers.  Have a
system in place where your DBA / SA can promote a slave to master in
one or two easy steps, and if / when the master truly fails, then run
that script.  A human can make that decision with far more care than a
piece of code.

>    -what happens when the old master comes back up? Do I need to so anything to make it catches up to the new master?

You can't let the old master come back up as thinking it's the master
as well.  You have to re-establish replication to it as a slave.
Again, this is usually not automated, at least not at first.  The old
master needs to be "shot in the head" so to speak before it comes back
up, or your app may start writing to it instead of or as well as the
new master, and now you've got split-brain problems.

In short automated failover is complicated to get right, and if you
get it wrong the cost of the consequences can far worse than the 5 or
10 minutes of downtime required for a manual switch-over.  First write
scripts that automate most of the task for your application and db
farm.  Test those scripts as much as you can on a test farm.  Then run
them when needed by hand when things go wrong.  If or when you're
certain you've got all the bugs worked out and all the possible
failure scenarios worked out, you can start testing automated
failover.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general