On 16/06/14 07:43 AM, Le Trung Kien wrote:
Hello everyone,
I'm a new man on linux cluster. I have built a two-node cluster (without qdisk), includes:
Redhat 6.4
cman
pacemaker
gfs2
My cluster could fail-over (back and forth) between two nodes for these 3 resources: ClusterIP, WebFS (Filesystem GFS2 mount /dev/sdc on /mnt/gfs2_storage), WebSite ( apache service)
My problem occurs when I stop/start node in the following order: (when both nodes started)
1. Stop: node1 (shutdown) -> all resource fail-over on node2 -> all resources still working on node2
2. Stop: node2 (stop service: pacemaker then cman) -> all resources stop (of course)
3. Start: node1 (start service: cman then pacemaker) -> only ClusterIP started, WebFS failed, WebSite not started
Status:
Last updated: Mon Jun 16 18:34:56 2014
Last change: Mon Jun 16 14:24:54 2014 via cibadmin on server1
Stack: cman
Current DC: server1 - partition WITHOUT quorum
Version: 1.1.8-7.el6-394e906
2 Nodes configured, 1 expected votes
4 Resources configured.
Online: [ server1 ]
OFFLINE: [ server2 ]
ClusterIP (ocf::heartbeat:IPaddr2): Started server1
WebFS (ocf::heartbeat:Filesystem): Started server1 (unmanaged) FAILED
Failed actions:
WebFS_stop_0 (node=server1, call=32, rc=1, status=Timed Out): unknown error
Here is my /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="1" name="mycluster">
<logging debug="on"/>
<clusternodes>
<clusternode name="server1" nodeid="1">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="server1"/>
</method>
</fence>
</clusternode>
<clusternode name="server2" nodeid="2">
<fence>
<method name="pcmk-redirect">
<device name="pcmk" port="server2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="pcmk" agent="fence_pcmk"/>
</fencedevices>
</cluster>
Here is my: crm configure show
<snip>
stonith-enabled=false \
Well this is a problem.
When cman detects a failure (well corosync, but cman is told), it
initiates a fence request. The fence daemon informs DLM with blocks.
Then fenced calls the configured 'fence_pcmk', which just passes the
request up to pacemaker.
Without stonith configured in fencing, pacemaker will fail to fence, of
course. Thus, DLM sits blocked, so DRBD (and clustered LVM) hang, by
design.
If configure proper fencing in pacemaker (and test it to make sure it
works), then pacemaker *would* succeed in fencing and return a success
to fence_pcmk. Then fenced is told that the fence succeeds, DLM cleans
up lost locks and returns to normal operation.
So please configure and test real stonith in pacemaker and see if your
problem is resolved.
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster