Hi all, I'm using the openais based cman-2.0.35.el5 and I'm trying to understand how the quorum disk concept is implemented in rhcs, after various experiments I think that I found at least 2 problems: Problem 1) Little bug in the quorum disk polling mechanism: looking at the code in cman/daemon/commands.c the variable quorumdev_poll = 10000 is expressed in milliseconds and used to call "quorum_device_timer_fn" every quorumdev_poll interval to check if qdiskd is informing cman that the node can use the quorum votes. The same variable is then used in quorum_device_timer_fn, but here it's used as seconds: if (quorum_device->last_hello.tv_sec + quorumdev_poll < now.tv_sec) { so, when the qdisks dies, or the access to the quorum disk is lost it will take more than 2 hours to notify this and recalculate the quorum. After changing the line: ======================================================================== --- cman-2.0.35.orig/cman/daemon/commands.c 2007-01-07 21:01:30.000000000 +0100 +++ cman-2.0.35.patched/cman/daemon/commands.c 2007-01-05 18:12:33.000000000 +0100 @@ -1038,15 +1037,12 @@ static void ccsd_timer_fn(void *arg) static void quorum_device_timer_fn(void *arg) { struct timeval now; if (!quorum_device || quorum_device->state == NODESTATE_DEAD) return; gettimeofday(&now, NULL); - if (quorum_device->last_hello.tv_sec + quorumdev_poll < now.tv_sec) { + if (quorum_device->last_hello.tv_sec + quorumdev_poll/1000 < now.tv_sec) { quorum_device->state = NODESTATE_DEAD; log_msg(LOG_INFO, "lost contact with quorum device\n"); recalculate_quorum(0); ======================================================================== it worked. A more precise fix should be the use if tv_usec/1000 instead of tv_sec. Problem 2) After fixing Problem 1, if I set in the quorumd tag of cluster.conf an interval > quorumdev_poll/1000*2 the quorum is lost then regained over and over as the polling frequency of qdiskd is less than the polling one of cman. Probably the right thing to do is to calculate the value of quorumdev_poll from the ccs return value of "/cluster/quorumd/@interval" and quorumdev_poll=interval*1000*2 should be ok. What do you think about these problems? I'll be happy to fix them providing a full patch. Thanks. Bye! -- Simone Gotti -- Email.it, the professional e-mail, gratis per te: http://www.email.it/f Sponsor: Cerchi un gioiello per te o da regalare? Sfoglia il nostro catalogo on-line e non lasciarti sfuggire le numerose occasioni presenti! Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=5631&d=7-1 -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster