.. small correction of the qdiskd->heuristic script timing: dummy: Fri May 13 08:59:16 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 <--qdiskd restart, rval=1 dummy: Fri May 13 08:59:21 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 dummy: Fri May 13 08:59:26 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 dummy: Fri May 13 08:59:31 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 dummy: Fri May 13 08:59:36 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 dummy: Fri May 13 08:59:41 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 dummy: Fri May 13 08:59:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 dummy: Fri May 13 08:59:51 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=1 dummy: Fri May 13 08:59:56 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <--changed script, rval=0 dummy: Fri May 13 09:00:01 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:00:06 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:00:11 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <-- until this point ok (dt=5s) dummy: Fri May 13 09:01:53 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <-- below: ?? every 103s ? dummy: Fri May 13 09:03:35 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:05:17 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:06:58 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:08:40 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:10:22 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:12:04 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:23:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <-- ?? no regular checks ? dummy: Fri May 13 09:31:48 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 10:20:19 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 10:40:29 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Gerbatsch, Andre Sent: Freitag, 13. Mai 2011 12:10 To: 'linux-cluster@xxxxxxxxxx' Subject: qdiskd does not call heuristics regularly? Hello, Im at a point where I have different answers from different experts, read "qdiskd" source code by myself and would be happy if someone could help me: I expected in my configuration (see below) that a heuristics script will be called on a regularly bases (every "interval" s) to have a chance to influence quorumd scores if something happened with the cluster node. What I see is, that there were some cycles during quorum device initialization, after that heuristics is called "from time to time". Question: is this the expected behavior ? If yes, is there a chance to call heuristics regularly ? Question2: how can I determine the cman/qdisk version I use.. cman_1_0_??? (see rpm -qi cman) The final effect is: if I disconnect one node in a 2-node cluster from network the "wrong" node won - and heuristics had no influence on the fencing decision. Thank you in advance for any response Andre ================================================= == rpm -qi cman Name : cman Relocations: (not relocatable) Version : 2.0.115 Vendor: Red Hat, Inc. Release : 68.el5_6.1 Build Date: Mon Dec 20 19:28:36 2010 Install Date: Thu Apr 28 11:11:43 2011 Build Host: ls20-bc2-14.build.redhat.com Group : System Environment/Base Source RPM: cman-2.0.115-68.el5_6.1.src.rpm Size : 2619414 License: GPL Signature : DSA/SHA1, Fri Dec 31 06:29:03 2010, Key ID 5326810137017186 Packager : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla> URL : http://sources.redhat.com/cluster/ Summary : cman - The Cluster Manager Description : cman - The Cluster Manager == cluster.conf: .. <totem consensus="4800" join="60" token="60000" token_retransmits_before_loss_const="20"/> <quorumd status_file="/tmp/qdiskd_status" log_level="7" interval="5" device="/dev/mapper/xp1_00p1" tko="5" votes="1"> <heuristic interval="5" program="/root/root/cluster/checkpvtlink.sh eth0" score="1" tko="3"/> </quorumd> .. == > ps -eLf | grep qdiskd root 3976 1 3976 0 3 08:59 ? 00:00:00 qdiskd -Q root 3976 1 3978 0 3 08:59 ? 00:00:00 qdiskd -Q root 3976 1 4226 0 3 08:59 ? 00:00:00 qdiskd -Q root 21613 12673 21613 0 1 10:45 pts/0 00:00:00 grep qdiskd == strace "score thread" (hopefully :-) = it seems simply waiting for some timer... clock_gettime(CLOCK_MONOTONIC, {60774, 182881847}) = 0 clock_gettime(CLOCK_MONOTONIC, {60774, 182920847}) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0 nanosleep({1, 0}, {1, 0}) = 0 clock_gettime(CLOCK_MONOTONIC, {60775, 202918847}) = 0 clock_gettime(CLOCK_MONOTONIC, {60775, 202961847}) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0 nanosleep({1, 0}, {1, 0}) = 0 clock_gettime(CLOCK_MONOTONIC, {60776, 222868847}) = 0 clock_gettime(CLOCK_MONOTONIC, {60776, 222912847}) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE KILL SEGV TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0 nanosleep({1, 0}, <unfinished ...> Process 3978 detached .. seems to me that this is the score thread with a "wrong" h->nextrun.. but I think I simply do not understand smthg.. cman/qdiskd/score.c: from http://git.fedorahosted.org/git/?p=cluster.git;a=summary 99 fork_heuristic(struct h_data *h) 100 { ... 110 now = time(NULL); 111 if (now < h->nextrun) 112 return 0; 113 114 h->nextrun = now + h->interval; 115 116 pid = fork(); == output from heuristic testscript > cat checkpvtlink.sh #!/bin/sh rval=0 echo "dummy: $(date) $0 rval=$rval" >> /root/root/cluster/checkpvtlink.log exit $rval > tail checkpvtlink.log dummy: Fri May 13 09:03:35 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <== service qdiskd restart dummy: Fri May 13 09:05:17 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:06:58 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:08:40 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:10:22 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:12:04 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:23:46 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 09:31:48 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 dummy: Fri May 13 10:20:19 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 <== why so late ?? dummy: Fri May 13 10:40:29 CEST 2011 /root/root/cluster/checkpvtlink.sh rval=0 Andre Gerbatsch MTS IT Systems Engineer Tel +49 (0) 351 277-1762 Fax +49 (0) 351 277-91762 andre.gerbatsch@xxxxxxxxxxxxxxxxxxx GLOBALFOUNDRIES Dresden Module Two GmbH & Co. KG Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland, Sitz Dresden I Registergericht Dresden HRA 4896 -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster