Ok, I've done some more testing now and manully running "killall fenced; fenced; fence-tool join -c" after each reboot on every node seems to be the only way to get fenced in a working state. But that's obviously not a good solution. I tried adding clean_start="" to cluster.conf but that didn't help. Is this perhaps a fenced bug? And if so, how can I help debugging it or work around it? Regards, Jonas -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Borgström Jonas Sent: den 22 augusti 2007 17:31 To: linux clustering Subject: RE: Node fencing problem Yes, that did the trick. I did the following on both nodes: $ killall fenced $ fenced $ fence_tool join -c After that I tested the same thing as earlier but this time the failed node was fensed! /var/log/messages output: Aug 22 16:08:38 test-db2 openais[3404]: [CLM ] CLM CONFIGURATION CHANGE Aug 22 16:08:38 test-db2 openais[3404]: [CLM ] New Configuration: Aug 22 16:08:38 test-db2 kernel: dlm: closing connection to node 1 Aug 22 16:08:38 test-db2 fenced[28702]: test-db1.example.com not a cluster member after 0 sec post_fail_delay Aug 22 16:08:38 test-db2 openais[3404]: [CLM ] r(0) ip(10.100.2.6) Aug 22 16:08:38 test-db2 fenced[28702]: fencing node "test-db1.example.com" So I guess the only thing left to do now is to understand why it doesn't work without commands above. The only reason I can think of is that because I'm using fence_drac to power on/off the nodes so one node is usually started 10-20 seconds after the other and this usually results in one node being fenced by the other before the cluster starts up. But after that everything appears to work, but I guess that somehow confuses fenced. Here's the "group dump fence" output after a restart of all nodes: test-db1: http://pastebin.com/m73104841 test-db2: http://pastebin.com/ma427c89 / Jonas -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jeremy Carroll Sent: den 22 augusti 2007 16:57 To: linux clustering Subject: RE: Node fencing problem Try restarting fence cleanly. Stop the fenced service and run this command. fence_tool join -c -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Borgström Jonas Sent: Wednesday, August 22, 2007 9:49 AM To: linux clustering Subject: RE: Node fencing problem Yes, both "fence_drac ..." and "fence_node test-db1.example.com" works. The strange thing is that during the test I described earlier it looks like the cluster didn't even try to fence the failed node. /var/log/messages didn't mentioning anything about trying to fence any node. And neither did "group_tool dump fence". And even if something would be wrong with the fence_drac configuration wouldn't fence_manual kick in instead? / Jonas -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jeremy Carroll Sent: den 22 augusti 2007 16:32 To: linux clustering Subject: RE: Node fencing problem See if the fence agent is configured correctly. Run this command to see if it shuts down the node in question. fence_drac -a 10.100.2.40 -l testdb -p <passwd> -o off If it does not work I would check to see if access is enabled. By default, the telnet interface is not enabled. To enable the interface, you will need to use the racadm command in the racser-devel rpm available from Dell. To enable telnet on the DRAC: [root]# racadm config -g cfgSerial -o cfgSerialTelnetEnable 1 [root]# racadm Racreset -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Borgström Jonas Sent: Wednesday, August 22, 2007 9:22 AM To: linux clustering Subject: RE: Node fencing problem As you can see here http://pastebin.com/m7ac9376d I've configured both fence_drac and fence_manual. And fenced appears to be running: [root@test-db1 ~]# ps ax | grep fence 3412 ? Ss 0:00 /sbin/fenced 5109 pts/0 S+ 0:00 grep fence [root@test-db1 ~]# cman_tool services type level name id state fence 0 default 00010001 JOIN_START_WAIT [1 2] dlm 1 clvmd 00020002 JOIN_START_WAIT [1 2] dlm 1 rgmanager 00030002 JOIN_START_WAIT [1 2] dlm 1 pg_fs 00050002 JOIN_START_WAIT [1 2] gfs 2 pg_fs 00040002 JOIN_START_WAIT [1 2] And on test-db2: [root@test-db2 ~]# ps ax | grep fence 3428 ? Ss 0:00 /sbin/fenced 8848 pts/0 S+ 0:00 grep fence [root@test-db2 ~]# cman_tool services type level name id state fence 0 default 00010002 JOIN_START_WAIT [1 2] dlm 1 clvmd 00020002 JOIN_START_WAIT [1 2] dlm 1 rgmanager 00030002 JOIN_START_WAIT [1 2] dlm 1 pg_fs 00050002 JOIN_START_WAIT [1 2] gfs 2 pg_fs 00040002 JOIN_START_WAIT [1 2] / Jonas -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jeremy Carroll Sent: den 22 augusti 2007 15:47 To: linux clustering Subject: RE: Node fencing problem What type of fencing method are you using on your cluster? Also can you run "cman_tool services" on both nodes to make sure Fenced is running? -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Borgström Jonas Sent: Wednesday, August 22, 2007 4:07 AM To: linux-cluster@xxxxxxxxxx Subject: Node fencing problem Hi, We're having some problems getting fencing to work as expected on our two-node cluster. Our cluster.conf file: http://pastebin.com/m7ac9376d kernel version: 2.6.18-8.1.8.el5 cman version: 2.0.64-1.0.1.el5 When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason: Steps to reproduce: 1. Start the cluster 2. Mount a GFS filesystem on both nodes (test-db1 and test-db2) 3. Simulate a net failure on test-db1 http://pastebin.com/m19fda088 Expected result: 1. Node test-db2 would detect that test-db1 failed 2. test-db1 get fenced by test-db2 3. test-db2 replays the GFS journal (filesystem writable again) 4. Fail over services from test-db1 to test-db2 Actual result: 1. Node-test-db2 detects that something happened to test-db1 2. test-db2 replays the GFS journal (filesystem writable again) 3. The service on test-db1 is still listed as started and not failed over to test-db2 even though test-db2 thinks test-db1 is "offline". Log files and debug output from test-db2: /var/log/messages after the failure: http://pastebin.com/m2fe4ce36 "group_tool dump fence" output: http://pastebin.com/m79d21ed9 clustat output: http://pastebin.com/m4d1007c2 And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption. I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough. What am I doing wrong? Regards, Jonas -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster