Re: Use qdisk heuristics w/o a quorum device/partition

Gerhard Spiegl <gspiegl@xxxxxx> · Tue, 09 Sep 2008 21:19:15 +0200

Hello Kevin,

thanks for your reply.

Kevin Anderson wrote:
> On Tue, 2008-09-09 at 17:51 +0200, Gerhard Spiegl wrote:
>> Hello all!
>>
>> We are trying to set up a 20 node cluster and want to use a 
>> "ping-heuristic" and a heuristic that checks the state of
>> the fiberchannel ports.
> 
> What actions do you want to take place based on these heuristics?

The node should get fenced (or fence/reboot itself) if the public
interface (bond0) looses connection - or both paths (dm-multipath)
to the storage get lost.

Without quorum device:
We faced the problem of complete loss of storage connectivity resulting
the GFS to withdraw (only when IO is issued on it (we only use GFS for
xen vm definition files)), causing GFS and CLVM to lockup and never
released. Only the manual reboot/halt solves the situation (in addition
the specific node gets fenced after poweroff - trifle to late ;)).

With quorum device:
The node loosing the storage gets fenced because it looses the qdisk.

Obviousley, but >16 nodes qdisk is not an option, so I wrote a small shell
script to check the fiberchannel paths. The idea is when FC is lost, the node
fences/reboots itself. But the heuristics only work when a "device=/dev/dm-8"
is specified in cluster.conf <quorumd ..> tag. Without it the qdiskd refuses to
start.

>> Is it possible to use qdisk heuristics without a dedicated
>> quorum partition, as this setup would only support 16 nodes?
>>
> There is a 16 node limitation to qdisk primarily because we think
> performance hitting the same small number of blocks on the disk by that
> many nodes will be abysmal.  Lon would know, but probably a value you
> could change and play with in the code.  

I read about this in the cluster wiki/FAQ and it sounds comprehensible,
also we dont want to play around in the source as our goal is a fully
supported configuration by RedHat.com

> Am more interested in what problem you are trying to solve with the
> heuristics?  It doesn't seem to be quorum related as the normal
> cman/openais capabilities will work fine with that number of nodes.

It seems not to, as stated the loss of storage connectivity causes the
whole cluster to disfunction, wich is not expected.
If it helps I will send cluster.conf tomorrow as I m off the office today
( CET ).

Maybe there is another way of detecting the storage failure but I couldnt
find any docs. Also I would be glad if you could point me to a more
comprehensive documentation anywhere on the net.

> you are worried about split sites, just add an additional node to the
> cluster that is some other location.  The node would only be used for
> quorum votes.

I am not sure what you mean with split sites (split brain?), but thats not the
issue. Do you mean an additional node without any service or failoverdomain
configured?

regards
Gerhard

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster