Re: Geographic High-Availability/Replication

Markus Schiltknecht <markus@xxxxxxxxxx> · Tue, 28 Aug 2007 01:13:05 +0200

Hello Bill,

Bill Moran wrote:
It appears as if I miscommunicated my point.  I'm not expecting
PostgreSQL-R to break the laws of physics or anything, I'm just
curious how it reacts.  This is the difference between software
that will be really great one day, and software that is great now.

Agreed. As Postgres-R is still a prototype, it does *currently* not 
handle the situation at all. But I'm thankful for this discussion, as it 
it helps me figuring out how Postgres-R *should* react. So, thank you 
for pointing this out.

Great now would mean the system would notice that it's too far behind
and Do The Right Thing automatically.  I'm not exactly sure what The
Right Thing is, but my first guess would be force the hopelessly
slow node out of the cluster.  I expect this would be non-trivial,
as you've have to have a way to ensure it was a problem isolated to
a single (or few) nodes, and not just the whole cluster getting hit
with unexpected traffic.

Hm.. yeah, that's a tricky decision to make. For a start, I'd be in 
favor of just informing the administrator about the delay and let him 
take care of the problem (as currently done with 'disk full' 
conditions). Instead of trying to do something clever automatically. 
(This seems to be much more PostgreSQL-like, too).

Of course not, that's why the behaviour when that non-ideal situation
occurs is so interesting.  How does PostgreSQL-R fail?  PostgreSQL
fails wonderfully: A hardware crash will usually result in a system
that can recover without operator intervention.  In a system like
PostgreSQL-R, the failure scenarios are more numerous, and probably
more complicated.

I agree that there are more failure scenarios. Although fewer are 
critical to the complete system.

IMO, a node which is too slow should not be considered a failure, but 
rather a system limitation (possibly due to unfortunate configuration), 
much like out of memory or disk space conditions. Forcing such a node to 
go down could have unwanted side effects on the other nodes (i.e. 
increased read-only traffic) *and* does not solve the real problem.

Again, thanks for pointing this out. I'll think more about some issues, 
especially similar corner cases like this one. Single-node disk full 
would be another example. Possibly also out of memory conditions?

Regards

Markus

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq