Yes, both "fence_drac ..." and "fence_node test-db1.example.com" works. The strange thing is that during the test I described earlier it looks like the cluster didn't even try to fence the failed node. /var/log/messages didn't mentioning anything about trying to fence any node. And neither did "group_tool dump fence". And even if something would be wrong with the fence_drac configuration wouldn't fence_manual kick in instead? / Jonas -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jeremy Carroll Sent: den 22 augusti 2007 16:32 To: linux clustering Subject: RE: Node fencing problem See if the fence agent is configured correctly. Run this command to see if it shuts down the node in question. fence_drac -a 10.100.2.40 -l testdb -p <passwd> -o off If it does not work I would check to see if access is enabled. By default, the telnet interface is not enabled. To enable the interface, you will need to use the racadm command in the racser-devel rpm available from Dell. To enable telnet on the DRAC: [root]# racadm config -g cfgSerial -o cfgSerialTelnetEnable 1 [root]# racadm Racreset -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Borgström Jonas Sent: Wednesday, August 22, 2007 9:22 AM To: linux clustering Subject: RE: Node fencing problem As you can see here http://pastebin.com/m7ac9376d I've configured both fence_drac and fence_manual. And fenced appears to be running: [root@test-db1 ~]# ps ax | grep fence 3412 ? Ss 0:00 /sbin/fenced 5109 pts/0 S+ 0:00 grep fence [root@test-db1 ~]# cman_tool services type level name id state fence 0 default 00010001 JOIN_START_WAIT [1 2] dlm 1 clvmd 00020002 JOIN_START_WAIT [1 2] dlm 1 rgmanager 00030002 JOIN_START_WAIT [1 2] dlm 1 pg_fs 00050002 JOIN_START_WAIT [1 2] gfs 2 pg_fs 00040002 JOIN_START_WAIT [1 2] And on test-db2: [root@test-db2 ~]# ps ax | grep fence 3428 ? Ss 0:00 /sbin/fenced 8848 pts/0 S+ 0:00 grep fence [root@test-db2 ~]# cman_tool services type level name id state fence 0 default 00010002 JOIN_START_WAIT [1 2] dlm 1 clvmd 00020002 JOIN_START_WAIT [1 2] dlm 1 rgmanager 00030002 JOIN_START_WAIT [1 2] dlm 1 pg_fs 00050002 JOIN_START_WAIT [1 2] gfs 2 pg_fs 00040002 JOIN_START_WAIT [1 2] / Jonas -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jeremy Carroll Sent: den 22 augusti 2007 15:47 To: linux clustering Subject: RE: Node fencing problem What type of fencing method are you using on your cluster? Also can you run "cman_tool services" on both nodes to make sure Fenced is running? -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Borgström Jonas Sent: Wednesday, August 22, 2007 4:07 AM To: linux-cluster@xxxxxxxxxx Subject: Node fencing problem Hi, We're having some problems getting fencing to work as expected on our two-node cluster. Our cluster.conf file: http://pastebin.com/m7ac9376d kernel version: 2.6.18-8.1.8.el5 cman version: 2.0.64-1.0.1.el5 When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason: Steps to reproduce: 1. Start the cluster 2. Mount a GFS filesystem on both nodes (test-db1 and test-db2) 3. Simulate a net failure on test-db1 http://pastebin.com/m19fda088 Expected result: 1. Node test-db2 would detect that test-db1 failed 2. test-db1 get fenced by test-db2 3. test-db2 replays the GFS journal (filesystem writable again) 4. Fail over services from test-db1 to test-db2 Actual result: 1. Node-test-db2 detects that something happened to test-db1 2. test-db2 replays the GFS journal (filesystem writable again) 3. The service on test-db1 is still listed as started and not failed over to test-db2 even though test-db2 thinks test-db1 is "offline". Log files and debug output from test-db2: /var/log/messages after the failure: http://pastebin.com/m2fe4ce36 "group_tool dump fence" output: http://pastebin.com/m79d21ed9 clustat output: http://pastebin.com/m4d1007c2 And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption. I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough. What am I doing wrong? Regards, Jonas -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster