Re: node fails to join cluster after it was fenced

Frederik Ferner <frederik.ferner@xxxxxxxxxxxxx> · Wed, 14 Feb 2007 16:03:48 +0000

Hi Patrick,

thanks for you reply.

I've just discovered that I seem to have the same problem on one more
cluster, so maybe I've change something that causes this but did not
affect a running cluster. I'll append the cluster.conf for the original
cluster as well.

On Wed, 2007-02-14 at 14:06 +0000, Patrick Caulfield wrote:
> Frederik Ferner wrote:
> > I've recently run into the problem that in one of my clusters the second
> > node doesn't join the cluster anymore.
> > 
> > First some background on my setup here. I have a couple of two node
> > clusters connected to a common storage each. They're basically identical
> > setups running basically RHEL4U4 and corresponding cluster suite.
> > Everything was running fine until yesterday in one clusters one node
> > (i04-storage2) was fenced and can't seem to join the cluster anymore,
> > all I could find was messages in the log files of i04-storage2 telling
> > me "kernel: CMAN: sending membership request" over and over again. On
> > the node still in the cluster (i04-storage1) I could see nothing in any
> > log files. 

> The main reason a node would repeatedly try to rejoin a cluster is that it gets
> told to "wait" by the remaining nodes. This happens when the remaining cluster
> nodes are still in transition state (ie they haven't sorted out the cluster
> after the node has left). Normally this state only lasts a fraction of a second
> or maybe a handful of seconds for a very large cluster.
> 
> As you only have one node in the cluster It sounds like the remaining node may
> be in some strange state that it can't get out of. I'm not sure what that would
> be off-hand...
> 
> - it must be able to see the fenced nodes 'joinreq' messages because if you
> increment the config version in reject it.

That's what I assumed.

> - it can't even be in transition here for the same reason ... the transition
> state is checked before the validity of the joinreq message so the former case
> would also fail!
> 
> Can you check the output of 'cman_tool status' and see what state the remaining
> node is in. It might also be worth sending me the 'tcpdump -s0 -x port 6809'
> output in case that shows anything useful.

See attached file for tcpdump output.

<snip>
[bnh65367@i04-storage1 log]$ cman_tool status
Protocol version: 5.0.1
Config version: 20
Cluster name: i04-cluster
Cluster ID: 33460
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 2
Total_votes: 4
Quorum: 3
Active subsystems: 8
Node name: i04-storage1.diamond.ac.uk
Node ID: 1
Node addresses: 172.23.104.33

[bnh65367@i04-storage1 log]$
</snip>

Thanks,
Frederik

-- 
Frederik Ferner 
Systems Administrator                  Phone: +44 (0)1235-778624
Diamond Light Source                   Fax:   +44 (0)1235-778468
Attachment:
i04_tcpdump_s0_port_6809

Description: Binary data
<?xml version="1.0"?>
<cluster alias="i04-cluster" config_version="20" name="i04-cluster">
	<fence_daemon post_fail_delay="0" post_join_delay="3"/>
	<clusternodes>
		<clusternode name="i04-storage1.diamond.ac.uk" votes="1">
			<fence>
				<method name="1">
					<device name="i04-storage1-mon"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="i04-storage2.diamond.ac.uk" votes="1">
			<fence>
				<method name="1">
					<device name="i04-storage2-mon"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<cman expected_votes="2"/>
	<fencedevices>
		<fencedevice agent="fence_ipmilan" auth="none" ipaddr="172.23.104.43" login="root" name="i04-storage1-mon" passwd="REMEMBERTHIS"/>
		<fencedevice agent="fence_ipmilan" auth="none" ipaddr="172.23.104.44" login="root" name="i04-storage2-mon" passwd="REMEMBERTHIS"/>
	</fencedevices>
	<rm>
		<failoverdomains>
			<failoverdomain name="nfs" ordered="1" restricted="1">
				<failoverdomainnode name="i04-storage1.diamond.ac.uk" priority="1"/>
				<failoverdomainnode name="i04-storage2.diamond.ac.uk" priority="1"/>
			</failoverdomain>
		</failoverdomains>
		<resources>
			<ip address="172.23.104.40" monitor_link="1"/>
			<nfsclient name="i04-net" options="rw" target="172.23.104.0/24"/>
			<nfsclient name="diamond-net" options="rw" target="172.23.0.0/16"/>
			<nfsexport name="nfs_export"/>
                        <smb name="i04_data" workgroup="DIAMOND"/>
			<script file="/etc/init.d/start_stop_quest_smb.sh" name="quest-samba"/>
			<clusterfs device="/dev/mapper/I04Data-data" force_unmount="0" fsid="10401" fstype="gfs" mountpoint="/exports/data" name="gfs_data" options="acl"/>
                </resources>
		<service autostart="1" domain="nfs" name="nfs_exports1">
                        <clusterfs ref="gfs_data">
				<nfsexport ref="nfs_export">
					<nfsclient ref="diamond-net"/>
				</nfsexport>
				<ip ref="172.23.104.40">
					<script ref="quest-samba"/>
				</ip>
			</clusterfs>
		</service>
	</rm>
	<quorumd interval="1" tko="10" votes="3" log_level="9" log_facility="local4" status_file="/tmp/qdisk_status" device="/dev/emcpowerq1">
		<heuristic program="ping 172.23.4.254 -c1 -t2" score="1" interval="2"/>
		<heuristic program="ping 172.23.104.254 -c1 -t1" score="2" interval="2"/>
		<heuristic program="ping 172.23.5.120 -c1 -t2" score="1" interval="2"/>
		<heuristic program="ping 172.23.104.32 -c1 -t1" score="1" interval="2"/>
		<heuristic program="ping 172.23.104.35 -c1 -t1" score="1" interval="2"/>
		<heuristic program="ping 172.23.104.38 -c1 -t1" score="1" interval="2"/>
		<heuristic program="ping 172.23.104.39 -c1 -t1" score="1" interval="2"/>
	</quorumd>
</cluster>
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster