Re: I give up

Scott Becker <scottb@xxxxxxxx> · Wed, 28 Nov 2007 13:27:14 -0800

Kevin Anderson wrote:

Not sure what you mean by 3 to 1 using IP tie breaker.  How are you
maintaining quorum without qdisk as a voting entity?

I have three nodes. If one fails the other two are expected to maintain
quorum and continue. I would really like a second failure to keep going
on it's own (last man standing). For this to work I would need to set
expected votes to 1 and make sure the correct node wins the ensuing
fencing race.

Case two. I remove one node from the cluster to maintain it. Now I have
a two node cluster. Same issues as above. Luci wants to set two_node =
1 in this case instead of just dealing with expected votes = 1. I
haven't test this because I'm testing all this with node 2 and node 3
while the future node 1 is currently our production server.

The ping gateway test/IP tie-breaker was my way of reliably running
down to last man standing.

  During network partition test, expecting a
fencing race where I control the outcome, one node would not fence the
other and did not takeover the service until the other node attempted
to rejoin the cluster (way too late).

Is this resolved with the 5.1 release we did a few weeks ago?

I'm using the latest release.

Another poster stated that he could not get the cluster to function
properly since the switch to Openais. Hence I'm speculating that they
are related.

Doubtful.  There have been issues with cisco switch configurations with
allowing multicast properly.  All of those have been resolved with a
switch configuration setting change.

I don't know why it "stared at me" instead of recovering the service,
because debugging is lacking. I really think that even if the "verbose
debugging" was a compile time option and users had to install "testing"
rpms, that all the problems would have been flushed out long ago.

...

Both of these are part of the bigger picture resource
monitoring work that Lon and some of the linux ha guys are jointly
working on converging to a single base.  See this page - 

  http://people.redhat.com/lhh/cluster-integration.html

Which again, not very visible :-(.

>From a distance, it seems that 5.0 and 5.1 are less stable than 4.4 and
4.5 (I've only tried the current ones). If big changes were made and
released prematurely, it's being shaken out by production clusters
instead of test clusters.

How much of this "not very visible" work is being tested by a larger
group?

3. Time for Cluster Summit again - location preferences, timeframe,
funding, etc?

Summit's are better than closed development but users like me are never
going to attend. A community based site is a good foundation.

By the way, I am a C programmer. (From windows land though we use RH on
all of our servers.) I've spent a month trying to get this to work.
It's open source and given enough time I can make it go. I don't have
any more time. It's supposed to be production quality.

I have a failure case staring at me but debugging is lacking so I have
to look else where for a solution. I can't sit here dangling my feet
waiting and I can't spend weeks fixing it myself.

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster