Re: Geographic High-Availability/Replication

Bill Moran <wmoran@xxxxxxxxxxxxxxxxx> · Mon, 27 Aug 2007 18:05:45 -0400

In response to Markus Schiltknecht <markus@xxxxxxxxxx>:

> Hi,
> 
> Bill Moran wrote:
> > I'm curious as to how Postgres-R would handle a situation where the
> > constant throughput exceeded the processing speed of one of the nodes.
> 
> Well, what do you expect to happen? This case is easily detectable, but 
> I can only see two possible solutions: either stop the node which is to 
> slow or stop accepting new transactions for a while.

It appears as if I miscommunicated my point.  I'm not expecting
PostgreSQL-R to break the laws of physics or anything, I'm just
curious how it reacts.  This is the difference between software
that will be really great one day, and software that is great now.

Great now would mean the system would notice that it's too far behind
and Do The Right Thing automatically.  I'm not exactly sure what The
Right Thing is, but my first guess would be force the hopelessly
slow node out of the cluster.  I expect this would be non-trivial,
as you've have to have a way to ensure it was a problem isolated to
a single (or few) nodes, and not just the whole cluster getting hit
with unexpected traffic.

> This technique is not meant to allow nodes to lag behind several 
> thousands of transactions - that should better be avoided. Rather it's 
> meant to decrease the commit delay necessary for synchronous replication.

Of course not, that's why the behaviour when that non-ideal situation
occurs is so interesting.  How does PostgreSQL-R fail?  PostgreSQL
fails wonderfully: A hardware crash will usually result in a system
that can recover without operator intervention.  In a system like
PostgreSQL-R, the failure scenarios are more numerous, and probably
more complicated.

> > I can see your system working if it's just spike loads and the slow
> > nodes can catch up during slow periods, but I'm wondering about the
> > scenarios where an admin has underestimated the hardware requirements
> > and one or more nodes is unable to keep up.
> 
> Please keep in mind, that replication per se does not speed your 
> database up, it rather adds a layer of reliability, which *costs* some 
> performance. To increase the transactional throughput you would need to 
> add partitioning to the mix. Or you could try to make use of the gained 
> reliability and abandon WAL - you won't need that as long as at least 
> one replica is running - that should increase the single node's 
> throughput and therefore the cluster's throughput, too.

I understand.  I'm not asking it to do something it's not designed to.
At least, I don't _think_ I am.

-- 
Bill Moran
http://www.potentialtech.com

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster