hi, On Mon, Jan 5, 2009 at 8:23 AM, Rajagopal Swaminathan <raju.rajsand@xxxxxxxxx> wrote: > Greetings, > > On Sat, Jan 3, 2009 at 4:18 AM, Paras pradhan <pradhanparas@xxxxxxxxx> wrote: >> >> Here I am using 4 nodes. >> >> Node 1) That runs luci >> Node 2) This is my iscsi shared storage where my virutal machine(s) resides >> Node 3) First node in my two node cluster >> Node 4) Second node in my two node cluster >> >> All of them are connected simply to an unmanaged 16 port switch. > > Luci need not require a separate node to run. it can run on one of the > member nodes (node 3 | 4). OK. > > what does clustat say? Here is my clustat o/p: ----------- [root@ha1lx ~]# clustat Cluster Status for ipmicluster @ Mon Jan 5 12:00:10 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ 10.42.21.29 1 Online, rgmanager 10.42.21.27 2 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- vm:linux64 10.42.21.27 started [root@ha1lx ~]# ------------------------ 10.42.21.27 is node3 and 10.42.21.29 is node4 > > Can you post your cluster.conf here? Here is my cluster.conf -- [root@ha1lx cluster]# more cluster.conf <?xml version="1.0"?> <cluster alias="ipmicluster" config_version="8" name="ipmicluster"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="10.42.21.29" nodeid="1" votes="1"> <fence> <method name="1"> <device name="fence2"/> </method> </fence> </clusternode> <clusternode name="10.42.21.27" nodeid="2" votes="1"> <fence> <method name="1"> <device name="fence1"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"/> <fencedevices> <fencedevice agent="fence_ipmilan" ipaddr="10.42.21.28" login="admin" name="fence1" passwd="admin"/> <fencedevice agent="fence_ipmilan" ipaddr="10.42.21.30" login="admin" name="fence2" passwd="admin"/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="myfd" nofailback="0" ordered="1" restricted="0"> <failoverdomainnode name="10.42.21.29" priority="2"/> <failoverdomainnode name="10.42.21.27" priority="1"/> </failoverdomain> </failoverdomains> <resources/> <vm autostart="1" domain="myfd" exclusive="0" migrate="live" name="linux64" path="/guest_roots" recovery="restart"/> </rm> </cluster> ------ Here: 10.42.21.28 is IPMI interface in node3 10.42.21.30 is IPMI interface in node4 > > When you pull out the network cable *and* plug it back in say node 3, > , what messages appear in the /var/log/messages if Node 4 (if any)? > (sorry for the repitition, but messages are necessary here to make any > sense of the situation) > Ok here is the log in node 4 after i disconnect the network cable in node3. ----------- Jan 5 12:05:24 ha2lx openais[4988]: [TOTEM] The token was lost in the OPERATIONAL state. Jan 5 12:05:24 ha2lx openais[4988]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Jan 5 12:05:24 ha2lx openais[4988]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Jan 5 12:05:24 ha2lx openais[4988]: [TOTEM] entering GATHER state from 2. Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] entering GATHER state from 0. Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] Creating commit token because I am the rep. Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] Saving state aru 76 high seq received 76 Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] Storing new sequence id for ring ac Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] entering COMMIT state. Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] entering RECOVERY state. Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] position [0] member 10.42.21.29: Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] previous ring seq 168 rep 10.42.21.27 Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] aru 76 high delivered 76 received flag 1 Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] Did not need to originate any messages in recovery. Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] Sending initial ORF token Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] CLM CONFIGURATION CHANGE Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] New Configuration: Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] r(0) ip(10.42.21.29) Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] Members Left: Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] r(0) ip(10.42.21.27) Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] Members Joined: Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] CLM CONFIGURATION CHANGE Jan 5 12:05:28 ha2lx kernel: dlm: closing connection to node 2 Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] New Configuration: Jan 5 12:05:28 ha2lx fenced[5004]: 10.42.21.27 not a cluster member after 0 sec post_fail_delay Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] r(0) ip(10.42.21.29) Jan 5 12:05:28 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: jid=1: Trying to acquire journal lock... Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] Members Left: Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] Members Joined: Jan 5 12:05:28 ha2lx openais[4988]: [SYNC ] This node is within the primary component and will provide service. Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] entering OPERATIONAL state. Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] got nodejoin message 10.42.21.29 Jan 5 12:05:28 ha2lx openais[4988]: [CPG ] got joinlist message from node 1 Jan 5 12:05:28 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: jid=1: Looking at journal... Jan 5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: jid=1: Acquiring the transaction lock... Jan 5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: jid=1: Replaying journal... Jan 5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: jid=1: Replayed 0 of 0 blocks Jan 5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: jid=1: Found 0 revoke tags Jan 5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: jid=1: Journal replayed in 1s Jan 5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: jid=1: Done ------------------ Now when I plug back my cable to node3, node 4 reboots and here is the quickly grabbed log in node4 -- Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] entering GATHER state from 11. Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] Saving state aru 1d high seq received 1d Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] Storing new sequence id for ring b0 Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] entering COMMIT state. Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] entering RECOVERY state. Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] position [0] member 10.42.21.27: Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] previous ring seq 172 rep 10.42.21.27 Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] aru 16 high delivered 16 received flag 1 Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] position [1] member 10.42.21.29: Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] previous ring seq 172 rep 10.42.21.29 Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] aru 1d high delivered 1d received flag 1 Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] Did not need to originate any messages in recovery. Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] CLM CONFIGURATION CHANGE Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] New Configuration: Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] r(0) ip(10.42.21.29) Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] Members Left: Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] Members Joined: Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] CLM CONFIGURATION CHANGE Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] New Configuration: Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] r(0) ip(10.42.21.27) Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] r(0) ip(10.42.21.29) Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] Members Left: Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] Members Joined: Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] r(0) ip(10.42.21.27) Jan 5 12:07:12 ha2lx openais[4988]: [SYNC ] This node is within the primary component and will provide service. Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] entering OPERATIONAL state. Jan 5 12:07:12 ha2lx openais[4988]: [MAIN ] Killing node 10.42.21.27 because it has rejoined the cluster with existing state Jan 5 12:07:12 ha2lx openais[4988]: [CMAN ] cman killed by node 2 because we rejoined the cluster without a full restart Jan 5 12:07:12 ha2lx gfs_controld[5016]: groupd_dispatch error -1 errno 11 Jan 5 12:07:12 ha2lx gfs_controld[5016]: groupd connection died Jan 5 12:07:12 ha2lx gfs_controld[5016]: cluster is down, exiting Jan 5 12:07:12 ha2lx dlm_controld[5010]: cluster is down, exiting Jan 5 12:07:12 ha2lx kernel: dlm: closing connection to node 1 Jan 5 12:07:12 ha2lx fenced[5004]: cluster is down, exiting ------- Also here is the log of node3: -- [root@ha1lx ~]# tail -f /var/log/messages Jan 5 12:07:24 ha1lx openais[26029]: [TOTEM] entering OPERATIONAL state. Jan 5 12:07:24 ha1lx openais[26029]: [CLM ] got nodejoin message 10.42.21.27 Jan 5 12:07:24 ha1lx openais[26029]: [CLM ] got nodejoin message 10.42.21.27 Jan 5 12:07:24 ha1lx openais[26029]: [CPG ] got joinlist message from node 2 Jan 5 12:07:27 ha1lx ccsd[26019]: Attempt to close an unopened CCS descriptor (4520670). Jan 5 12:07:27 ha1lx ccsd[26019]: Error while processing disconnect: Invalid request descriptor Jan 5 12:07:27 ha1lx fenced[26045]: fence "10.42.21.29" success Jan 5 12:07:27 ha1lx kernel: GFS2: fsid=ipmicluster:guest_roots.1: jid=0: Trying to acquire journal lock... Jan 5 12:07:27 ha1lx kernel: GFS2: fsid=ipmicluster:guest_roots.1: jid=0: Looking at journal... Jan 5 12:07:28 ha1lx kernel: GFS2: fsid=ipmicluster:guest_roots.1: jid=0: Done ---------------- > HTH > > With warm regards > > Rajagopal > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster > Thanks a lot Paras. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster