[cluster-linux] rejoining cluster after being fenced

"MARY, Mathieu" <Mathieu.MARY@xxxxxxxxxxxxxx> · Mon, 17 Mar 2008 15:12:30 +0100

hello,

i actually run a 2 node RH5.1 cluster with openais 0.80.3-13
and cman 2.0.80-1

both nodes are hosted on VMware ESX3.02 servers,
fencing works fine but here’s my issue :

whenever I simulate the failure of a node (shut Eth0
or hard reboot), the node is fenced but it can never rejoin the cluster again.

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM]
entering COMMIT state. 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM]
entering RECOVERY state. 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM]
position [0] member 10.148.46.50: 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM]
previous ring seq 7692 rep 10.148.46.50 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM]
aru c high delivered c received flag 1 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM]
position [1] member 10.148.46.51: 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM]
previous ring seq 7688 rep 10.148.46.51 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM]
aru b high delivered b received flag 1 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM]
Did not need to originate any messages in recovery. 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM]
Sending initial ORF token 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM 
] CLM CONFIGURATION CHANGE 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM 
] New Configuration: 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM 
]      r(0) ip(10.148.46.50)  

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM 
] Members Left: 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM 
] Members Joined: 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM 
] CLM CONFIGURATION CHANGE 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM 
] New Configuration: 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM 
]      r(0) ip(10.148.46.50)  

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM 
]      r(0) ip(10.148.46.51)  

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM 
] Members Left: 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM 
] Members Joined: 

Mar 17 14:24:32 VMClutest01 openais[1941]: [CLM 
]      r(0) ip(10.148.46.51)  

Mar 17 14:24:32 VMClutest01 openais[1941]: [SYNC ]
This node is within the primary component and will provide service. 

Mar 17 14:24:32 VMClutest01 openais[1941]: [TOTEM]
entering OPERATIONAL state. 

Mar 17 14:24:32 VMClutest01 openais[1941]: [MAIN ] Killing node
VMClutest02 because it has rejoined the cluster with existing state

is there anything to do after a failure in one node to
make it rejoing the cluster in a « clean » state ?

If I try to cleanly restart note 2 with “shutdown
–r now” it hangs on stopping cluster services 

if I hard reboot node 2 it can never rejoin cluster and
log is the same as above.

my cluster.conf

<?xml version="1.0"?>

<cluster alias="TestClu01"
config_version="9" name="TestClu01"><fence_daemon
clean_start="0" post_fail_delay="0"
post_join_delay="60"/>

<clusternodes>

<clusternode
name="VMClutest01" nodeid="1" votes="1">

<fence><method
name="FENCESX"><device name="ESX01"/></method>

</fence>

</clusternode>

<clusternode
name="VMClutest02" nodeid="2" votes="1">

<fence><method
name="FENCESX"><device
name="ESX02"/></method>

</fence>

</clusternode>

</clusternodes>

<cman
expected_votes="1" two_node="1"/>

<fencedevices>

<fencedevice
name="ESX01" agent="fence_vi3"
ipaddr="10.148.45.206" port="VMClutest01" login=""
passwd=" "/>

<fencedevice
name="ESX02" agent="fence_vi3"
ipaddr="10.148.45.206" port="VMClutest02" login=""
passwd=" "/>

</fencedevices>

 <rm>

<failoverdomains>

<failoverdomain
name="AppCluster" ordered="0" restricted="0">

<failoverdomainnode
name="VMClutest01" priority="1"/>

<failoverdomainnode
name="VMClutest02" priority="1"/>

</failoverdomain>

</failoverdomains>

<resources>

<ip
address="10.148.46.55" monitor_link="1"/>

</resources>

<service
autostart="1" domain="AppCluster" exclusive="0"
name="AppServer" recovery="restart">

<ip
ref="10.148.46.55"/>

</service>

</rm>

<totem consensus="4800"
join="1000" token="5000"
token_retransmits_before_loss_const="20"/>

</cluster>

any idea ? 

Mathieu

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster