Re: Tutorials on high availability Postgresql setup?

Andy <angelflow@xxxxxxxxx> · Thu, 7 Oct 2010 00:59:46 -0700 (PDT)

Ah thanks for the explanation. I was hoping for an automated setup without the need to get paged 24/7.

So HA is still as hard as I thought it would be. I was hoping that with 9.0 things would be easier.  

--- On Thu, 10/7/10, Scott Marlowe <scott.marlowe@xxxxxxxxx> wrote:

> From: Scott Marlowe <scott.marlowe@xxxxxxxxx>
> Subject: Re:  Tutorials on high availability Postgresql setup?
> To: "Andy" <angelflow@xxxxxxxxx>
> Cc: pgsql-general@xxxxxxxxxxxxxx
> Date: Thursday, October 7, 2010, 3:24 AM
> On Thu, Oct 7, 2010 at 12:27 AM, Andy
> <angelflow@xxxxxxxxx>
> wrote:
> > Is there any tutorials or detailed instructions on how
> to set up HA postgresql & failover? The documentation
> (http://www.postgresql.org/docs/9.0/interactive/warm-standby-failover.html)
> on this topics is pretty scarce.
> >
> > The scenario I'm most interested in is this:
> >
> > 2 servers - a master and a hot standby. All writes are
> sent to master, reads are split between master and hot
> standby.
> 
> To have true redundancy, you need 3 servers.  Just
> saying.  Otherwise
> when one goes down, no more redundancy.
> 
> > 1) If the hot standby goes down, how do I redirect
> reads to the master?
> 
> Have a config file for your app that tells it where to go
> for reads
> and writes.  Change the config file to point reads at
> a different db
> if a read slave fails.  What constitutes a failed read
> slave is kind
> of a business decision, so you'll likely have to write your
> own code
> to decide what being down means.
> 
> > 2) If the master fails
> >    -how do I automatically promote the standby to
> master and send all reads/writes to the new master?
> 
> First you need to decide if you actually want automated
> failovers.
> I've seen automated failovers cause as many problems as
> they were
> supposed to fix, but it can be done.  Keep in mind
> that on a two db
> system, failing over means you lose redundancy.  If
> your cluster fails
> over on a lot of false positives, that's a lot of time with
> no
> redundancy.  If your script isn't written with having
> only one node in
> mind, it might try to failover a second time with no read
> slave to
> promote to master.
> 
> Also, you're going to have to come up with what constitutes
> a failed
> master.  30 seconds non-responsive?  5
> minutes?  An hour?    If the
> problem is that the write master is simply overloaded, then
> failing
> over isn't gonna solve anything, as the now newly promoted
> master is
> going to collapse as well under even heavier load.  It
> might have been
> better to adjust the load factors used to determine where
> read queries
> go to take load off of the master, or to change a setting
> in your app
> that reduces load on the master.  With an overloaded
> write master,
> then failover, then overloaded even worse new write master
> you've got
> a site down, no redundancy, and you need to rebuild your
> old master as
> a read slave to handle the load.
> 
> To start with I do not recommend doing automatic
> failovers.  Have a
> system in place where your DBA / SA can promote a slave to
> master in
> one or two easy steps, and if / when the master truly
> fails, then run
> that script.  A human can make that decision with far
> more care than a
> piece of code.
> 
> >    -what happens when the old master comes back up?
> Do I need to so anything to make it catches up to the new
> master?
> 
> You can't let the old master come back up as thinking it's
> the master
> as well.  You have to re-establish replication to it
> as a slave.
> Again, this is usually not automated, at least not at
> first.  The old
> master needs to be "shot in the head" so to speak before it
> comes back
> up, or your app may start writing to it instead of or as
> well as the
> new master, and now you've got split-brain problems.
> 
> In short automated failover is complicated to get right,
> and if you
> get it wrong the cost of the consequences can far worse
> than the 5 or
> 10 minutes of downtime required for a manual
> switch-over.  First write
> scripts that automate most of the task for your application
> and db
> farm.  Test those scripts as much as you can on a test
> farm.  Then run
> them when needed by hand when things go wrong.  If or
> when you're
> certain you've got all the bugs worked out and all the
> possible
> failure scenarios worked out, you can start testing
> automated
> failover.
> 

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general