Re: qdiskd + cman: trying to fix the use of quorumdev_poll.

Lon Hohberger <lhh@xxxxxxxxxx> · Mon, 08 Jan 2007 10:43:26 -0500

On Sun, 2007-01-07 at 20:29 +0100, Simone Gotti wrote:
> Problem 2)
> 
> After fixing Problem 1, if I set in the quorumd tag of cluster.conf an
> interval > quorumdev_poll/1000*2 the quorum is lost then regained over
> and over as the polling frequency of qdiskd is less than the polling one
> of cman.
> Probably the right thing to do is to calculate the value of
> quorumdev_poll from the ccs return value of "/cluster/quorumd/@interval"
> and quorumdev_poll=interval*1000*2 should be ok.

I think the poll rate should be closer to (interval * tko * 1000) [10
seconds by default] - and not a function of just the quorum disk
interval.  

This is because after (interval*tko*1000), the master node of the
cluster will write an eviction message to a hung node - and that's when
qdiskd will either reboot the node or tell CMAN that its votes are no
longer valid.

I do not think it will cause any problems per se, but dropping qdiskd's
votes after ~2 seconds when the qdisk master won't write an eviction
notice for another ~8 seconds seems a bit odd.

Normal node failure delay should be >= 2*(i*t*1000).  There's a
parameter in the <totem> tag (which defaults to 5,000ms) - which should
be 2 * interval * tko * 1000, but I don't recall what it is right now.

qdiskd needs to time out before CMAN does.  While it doesn't have to be
"half or less", it's a good paranoia factor that's easy to remember, and
it gives the node plenty of time.

-- Lon
Attachment:
signature.asc

Description: This is a digitally signed message part
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster