RE: Node fencing problem

Borgström Jonas <jobot@xxxxxxxxxx> · Wed, 22 Aug 2007 16:22:24 +0200

As you can see here http://pastebin.com/m7ac9376d I've configured both fence_drac and fence_manual.

And fenced appears to be running:
[root@test-db1 ~]# ps ax | grep fence
 3412 ?        Ss     0:00 /sbin/fenced
 5109 pts/0    S+     0:00 grep fence
[root@test-db1 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010001 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

And on test-db2:
[root@test-db2 ~]# ps ax | grep fence
 3428 ?        Ss     0:00 /sbin/fenced
 8848 pts/0    S+     0:00 grep fence
[root@test-db2 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010002 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

/ Jonas
-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 15:47
To: linux clustering
Subject: RE:  Node fencing problem

What type of fencing method are you using on your cluster?

Also can you run "cman_tool services" on both nodes to make sure Fenced is running?

-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Borgström Jonas
Sent: Wednesday, August 22, 2007 4:07 AM
To: linux-cluster@xxxxxxxxxx
Subject:  Node fencing problem

Hi,

We're having some problems getting fencing to work as expected on our two-node cluster. 

Our cluster.conf file: http://pastebin.com/m7ac9376d
kernel version: 2.6.18-8.1.8.el5
cman version: 2.0.64-1.0.1.el5

When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason:

Steps to reproduce:
1. Start the cluster
2. Mount a GFS filesystem on both nodes (test-db1 and test-db2)
3. Simulate a net failure on test-db1
    http://pastebin.com/m19fda088

Expected result:
1. Node test-db2 would detect that test-db1 failed
2. test-db1 get fenced by test-db2
3. test-db2 replays the GFS journal (filesystem writable again)
4. Fail over services from test-db1 to test-db2

Actual result:
1. Node-test-db2 detects that something happened to test-db1
2. test-db2 replays the GFS journal (filesystem writable again)
3. The service on test-db1 is still listed as started and not failed
   over to test-db2 even though test-db2 thinks test-db1 is "offline".

Log files and debug output from test-db2:
   /var/log/messages after the failure: http://pastebin.com/m2fe4ce36
   "group_tool dump fence" output: http://pastebin.com/m79d21ed9
   clustat output: http://pastebin.com/m4d1007c2

And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption.

I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough.

What am I doing wrong?

Regards,
Jonas

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster