Hi Patrick, thanks for you reply. I've just discovered that I seem to have the same problem on one more cluster, so maybe I've change something that causes this but did not affect a running cluster. I'll append the cluster.conf for the original cluster as well. On Wed, 2007-02-14 at 14:06 +0000, Patrick Caulfield wrote: > Frederik Ferner wrote: > > I've recently run into the problem that in one of my clusters the second > > node doesn't join the cluster anymore. > > > > First some background on my setup here. I have a couple of two node > > clusters connected to a common storage each. They're basically identical > > setups running basically RHEL4U4 and corresponding cluster suite. > > Everything was running fine until yesterday in one clusters one node > > (i04-storage2) was fenced and can't seem to join the cluster anymore, > > all I could find was messages in the log files of i04-storage2 telling > > me "kernel: CMAN: sending membership request" over and over again. On > > the node still in the cluster (i04-storage1) I could see nothing in any > > log files. > The main reason a node would repeatedly try to rejoin a cluster is that it gets > told to "wait" by the remaining nodes. This happens when the remaining cluster > nodes are still in transition state (ie they haven't sorted out the cluster > after the node has left). Normally this state only lasts a fraction of a second > or maybe a handful of seconds for a very large cluster. > > As you only have one node in the cluster It sounds like the remaining node may > be in some strange state that it can't get out of. I'm not sure what that would > be off-hand... > > - it must be able to see the fenced nodes 'joinreq' messages because if you > increment the config version in reject it. That's what I assumed. > - it can't even be in transition here for the same reason ... the transition > state is checked before the validity of the joinreq message so the former case > would also fail! > > Can you check the output of 'cman_tool status' and see what state the remaining > node is in. It might also be worth sending me the 'tcpdump -s0 -x port 6809' > output in case that shows anything useful. See attached file for tcpdump output. <snip> [bnh65367@i04-storage1 log]$ cman_tool status Protocol version: 5.0.1 Config version: 20 Cluster name: i04-cluster Cluster ID: 33460 Cluster Member: Yes Membership state: Cluster-Member Nodes: 1 Expected_votes: 2 Total_votes: 4 Quorum: 3 Active subsystems: 8 Node name: i04-storage1.diamond.ac.uk Node ID: 1 Node addresses: 172.23.104.33 [bnh65367@i04-storage1 log]$ </snip> Thanks, Frederik -- Frederik Ferner Systems Administrator Phone: +44 (0)1235-778624 Diamond Light Source Fax: +44 (0)1235-778468
Attachment:
i04_tcpdump_s0_port_6809
Description: Binary data
<?xml version="1.0"?> <cluster alias="i04-cluster" config_version="20" name="i04-cluster"> <fence_daemon post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="i04-storage1.diamond.ac.uk" votes="1"> <fence> <method name="1"> <device name="i04-storage1-mon"/> </method> </fence> </clusternode> <clusternode name="i04-storage2.diamond.ac.uk" votes="1"> <fence> <method name="1"> <device name="i04-storage2-mon"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="2"/> <fencedevices> <fencedevice agent="fence_ipmilan" auth="none" ipaddr="172.23.104.43" login="root" name="i04-storage1-mon" passwd="REMEMBERTHIS"/> <fencedevice agent="fence_ipmilan" auth="none" ipaddr="172.23.104.44" login="root" name="i04-storage2-mon" passwd="REMEMBERTHIS"/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="nfs" ordered="1" restricted="1"> <failoverdomainnode name="i04-storage1.diamond.ac.uk" priority="1"/> <failoverdomainnode name="i04-storage2.diamond.ac.uk" priority="1"/> </failoverdomain> </failoverdomains> <resources> <ip address="172.23.104.40" monitor_link="1"/> <nfsclient name="i04-net" options="rw" target="172.23.104.0/24"/> <nfsclient name="diamond-net" options="rw" target="172.23.0.0/16"/> <nfsexport name="nfs_export"/> <smb name="i04_data" workgroup="DIAMOND"/> <script file="/etc/init.d/start_stop_quest_smb.sh" name="quest-samba"/> <clusterfs device="/dev/mapper/I04Data-data" force_unmount="0" fsid="10401" fstype="gfs" mountpoint="/exports/data" name="gfs_data" options="acl"/> </resources> <service autostart="1" domain="nfs" name="nfs_exports1"> <clusterfs ref="gfs_data"> <nfsexport ref="nfs_export"> <nfsclient ref="diamond-net"/> </nfsexport> <ip ref="172.23.104.40"> <script ref="quest-samba"/> </ip> </clusterfs> </service> </rm> <quorumd interval="1" tko="10" votes="3" log_level="9" log_facility="local4" status_file="/tmp/qdisk_status" device="/dev/emcpowerq1"> <heuristic program="ping 172.23.4.254 -c1 -t2" score="1" interval="2"/> <heuristic program="ping 172.23.104.254 -c1 -t1" score="2" interval="2"/> <heuristic program="ping 172.23.5.120 -c1 -t2" score="1" interval="2"/> <heuristic program="ping 172.23.104.32 -c1 -t1" score="1" interval="2"/> <heuristic program="ping 172.23.104.35 -c1 -t1" score="1" interval="2"/> <heuristic program="ping 172.23.104.38 -c1 -t1" score="1" interval="2"/> <heuristic program="ping 172.23.104.39 -c1 -t1" score="1" interval="2"/> </quorumd> </cluster>
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster