RE: Node fencing problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Yes, that did the trick. I did the following on both nodes:

$ killall fenced
$ fenced
$ fence_tool join -c

After that I tested the same thing as earlier but this time the failed node was fensed!

/var/log/messages output:

Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ] CLM CONFIGURATION CHANGE 
Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ] New Configuration: 
Aug 22 16:08:38 test-db2 kernel: dlm: closing connection to node 1
Aug 22 16:08:38 test-db2 fenced[28702]: test-db1.example.com not a cluster member after 0 sec post_fail_delay
Aug 22 16:08:38 test-db2 openais[3404]: [CLM  ]         r(0) ip(10.100.2.6)  
Aug 22 16:08:38 test-db2 fenced[28702]: fencing node "test-db1.example.com"


So I guess the only thing left to do now is to understand why it doesn't work without commands above. The only reason I can think of is that because I'm using fence_drac to power on/off the nodes so one node is usually started 10-20 seconds after the other and this usually results in one node being fenced by the other before the cluster starts up.

But after that everything appears to work, but I guess that somehow confuses fenced. 

Here's the "group dump fence" output after a restart of all nodes:

test-db1: http://pastebin.com/m73104841

test-db2: http://pastebin.com/ma427c89

/ Jonas
-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 16:57
To: linux clustering
Subject: RE:  Node fencing problem

Try restarting fence cleanly. Stop the fenced service and run this command.

fence_tool join -c

-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Borgström Jonas
Sent: Wednesday, August 22, 2007 9:49 AM
To: linux clustering
Subject: RE:  Node fencing problem

Yes, both "fence_drac ..." and "fence_node test-db1.example.com" works.

The strange thing is that during the test I described earlier it looks like the cluster didn't even try to fence the failed node. /var/log/messages didn't mentioning anything about trying to fence any node. And neither did "group_tool dump fence".

And even if something would be wrong with the fence_drac configuration wouldn't fence_manual kick in instead?

/ Jonas
-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 16:32
To: linux clustering
Subject: RE:  Node fencing problem

See if the fence agent is configured correctly. Run this command to see if it shuts down the node in question.

fence_drac -a 10.100.2.40 -l testdb -p <passwd> -o off

If it does not work I would check to see if access is enabled. By default, the telnet interface is not enabled. To enable the interface, you will need to use the racadm command in the racser-devel rpm available from Dell. To enable telnet on the DRAC:

[root]# racadm config -g cfgSerial -o cfgSerialTelnetEnable 1

[root]# racadm

Racreset


-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Borgström Jonas
Sent: Wednesday, August 22, 2007 9:22 AM
To: linux clustering
Subject: RE:  Node fencing problem

As you can see here http://pastebin.com/m7ac9376d I've configured both fence_drac and fence_manual.

And fenced appears to be running:
[root@test-db1 ~]# ps ax | grep fence
 3412 ?        Ss     0:00 /sbin/fenced
 5109 pts/0    S+     0:00 grep fence
[root@test-db1 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010001 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

And on test-db2:
[root@test-db2 ~]# ps ax | grep fence
 3428 ?        Ss     0:00 /sbin/fenced
 8848 pts/0    S+     0:00 grep fence
[root@test-db2 ~]# cman_tool services
type             level name       id       state       
fence            0     default    00010002 JOIN_START_WAIT
[1 2]
dlm              1     clvmd      00020002 JOIN_START_WAIT
[1 2]
dlm              1     rgmanager  00030002 JOIN_START_WAIT
[1 2]
dlm              1     pg_fs      00050002 JOIN_START_WAIT
[1 2]
gfs              2     pg_fs      00040002 JOIN_START_WAIT
[1 2]

/ Jonas
-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jeremy Carroll
Sent: den 22 augusti 2007 15:47
To: linux clustering
Subject: RE:  Node fencing problem

What type of fencing method are you using on your cluster?

Also can you run "cman_tool services" on both nodes to make sure Fenced is running?

-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Borgström Jonas
Sent: Wednesday, August 22, 2007 4:07 AM
To: linux-cluster@xxxxxxxxxx
Subject:  Node fencing problem

Hi,

We're having some problems getting fencing to work as expected on our two-node cluster. 

Our cluster.conf file: http://pastebin.com/m7ac9376d
kernel version: 2.6.18-8.1.8.el5
cman version: 2.0.64-1.0.1.el5

When I'm simulating a network failure on a node I expect it to be fenced by the other node but that doesn't happen for some reason:

Steps to reproduce:
1. Start the cluster
2. Mount a GFS filesystem on both nodes (test-db1 and test-db2)
3. Simulate a net failure on test-db1
    http://pastebin.com/m19fda088

Expected result:
1. Node test-db2 would detect that test-db1 failed
2. test-db1 get fenced by test-db2
3. test-db2 replays the GFS journal (filesystem writable again)
4. Fail over services from test-db1 to test-db2

Actual result:
1. Node-test-db2 detects that something happened to test-db1
2. test-db2 replays the GFS journal (filesystem writable again)
3. The service on test-db1 is still listed as started and not failed
   over to test-db2 even though test-db2 thinks test-db1 is "offline".

Log files and debug output from test-db2:
   /var/log/messages after the failure: http://pastebin.com/m2fe4ce36
   "group_tool dump fence" output: http://pastebin.com/m79d21ed9
   clustat output: http://pastebin.com/m4d1007c2

And if I restore network connectivity on test-db1 the filsystem will become writeable on that node as well and probably results in filesystem corruption.

I think the fencedevice part of cluster.conf is correct since nodes are sometimes fenced when the cluster is started and one node isn't joining fast enough.

What am I doing wrong?

Regards,
Jonas

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux