Ah thanks for the explanation. I was hoping for an automated setup without the need to get paged 24/7. So HA is still as hard as I thought it would be. I was hoping that with 9.0 things would be easier. --- On Thu, 10/7/10, Scott Marlowe <scott.marlowe@xxxxxxxxx> wrote: > From: Scott Marlowe <scott.marlowe@xxxxxxxxx> > Subject: Re: Tutorials on high availability Postgresql setup? > To: "Andy" <angelflow@xxxxxxxxx> > Cc: pgsql-general@xxxxxxxxxxxxxx > Date: Thursday, October 7, 2010, 3:24 AM > On Thu, Oct 7, 2010 at 12:27 AM, Andy > <angelflow@xxxxxxxxx> > wrote: > > Is there any tutorials or detailed instructions on how > to set up HA postgresql & failover? The documentation > (http://www.postgresql.org/docs/9.0/interactive/warm-standby-failover.html) > on this topics is pretty scarce. > > > > The scenario I'm most interested in is this: > > > > 2 servers - a master and a hot standby. All writes are > sent to master, reads are split between master and hot > standby. > > To have true redundancy, you need 3 servers. Just > saying. Otherwise > when one goes down, no more redundancy. > > > 1) If the hot standby goes down, how do I redirect > reads to the master? > > Have a config file for your app that tells it where to go > for reads > and writes. Change the config file to point reads at > a different db > if a read slave fails. What constitutes a failed read > slave is kind > of a business decision, so you'll likely have to write your > own code > to decide what being down means. > > > 2) If the master fails > > -how do I automatically promote the standby to > master and send all reads/writes to the new master? > > First you need to decide if you actually want automated > failovers. > I've seen automated failovers cause as many problems as > they were > supposed to fix, but it can be done. Keep in mind > that on a two db > system, failing over means you lose redundancy. If > your cluster fails > over on a lot of false positives, that's a lot of time with > no > redundancy. If your script isn't written with having > only one node in > mind, it might try to failover a second time with no read > slave to > promote to master. > > Also, you're going to have to come up with what constitutes > a failed > master. 30 seconds non-responsive? 5 > minutes? An hour? If the > problem is that the write master is simply overloaded, then > failing > over isn't gonna solve anything, as the now newly promoted > master is > going to collapse as well under even heavier load. It > might have been > better to adjust the load factors used to determine where > read queries > go to take load off of the master, or to change a setting > in your app > that reduces load on the master. With an overloaded > write master, > then failover, then overloaded even worse new write master > you've got > a site down, no redundancy, and you need to rebuild your > old master as > a read slave to handle the load. > > To start with I do not recommend doing automatic > failovers. Have a > system in place where your DBA / SA can promote a slave to > master in > one or two easy steps, and if / when the master truly > fails, then run > that script. A human can make that decision with far > more care than a > piece of code. > > > -what happens when the old master comes back up? > Do I need to so anything to make it catches up to the new > master? > > You can't let the old master come back up as thinking it's > the master > as well. You have to re-establish replication to it > as a slave. > Again, this is usually not automated, at least not at > first. The old > master needs to be "shot in the head" so to speak before it > comes back > up, or your app may start writing to it instead of or as > well as the > new master, and now you've got split-brain problems. > > In short automated failover is complicated to get right, > and if you > get it wrong the cost of the consequences can far worse > than the 5 or > 10 minutes of downtime required for a manual > switch-over. First write > scripts that automate most of the task for your application > and db > farm. Test those scripts as much as you can on a test > farm. Then run > them when needed by hand when things go wrong. If or > when you're > certain you've got all the bugs worked out and all the > possible > failure scenarios worked out, you can start testing > automated > failover. > -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general