On Mon, Jan 5, 2009 at 12:11 PM, Paras pradhan <pradhanparas@xxxxxxxxx> wrote: > hi, > > On Mon, Jan 5, 2009 at 8:23 AM, Rajagopal Swaminathan > <raju.rajsand@xxxxxxxxx> wrote: >> Greetings, >> >> On Sat, Jan 3, 2009 at 4:18 AM, Paras pradhan <pradhanparas@xxxxxxxxx> wrote: >>> >>> Here I am using 4 nodes. >>> >>> Node 1) That runs luci >>> Node 2) This is my iscsi shared storage where my virutal machine(s) resides >>> Node 3) First node in my two node cluster >>> Node 4) Second node in my two node cluster >>> >>> All of them are connected simply to an unmanaged 16 port switch. >> >> Luci need not require a separate node to run. it can run on one of the >> member nodes (node 3 | 4). > > OK. > >> >> what does clustat say? > > Here is my clustat o/p: > > ----------- > > [root@ha1lx ~]# clustat > Cluster Status for ipmicluster @ Mon Jan 5 12:00:10 2009 > Member Status: Quorate > > Member Name ID Status > ------ ---- ---- ------ > 10.42.21.29 1 > Online, rgmanager > 10.42.21.27 2 > Online, Local, rgmanager > > Service Name > Owner (Last) State > ------- ---- > ----- ------ ----- > vm:linux64 > 10.42.21.27 > started > [root@ha1lx ~]# > ------------------------ > > > 10.42.21.27 is node3 and 10.42.21.29 is node4 > > > >> >> Can you post your cluster.conf here? > > Here is my cluster.conf > > -- > [root@ha1lx cluster]# more cluster.conf > <?xml version="1.0"?> > <cluster alias="ipmicluster" config_version="8" name="ipmicluster"> > <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/> > <clusternodes> > <clusternode name="10.42.21.29" nodeid="1" votes="1"> > <fence> > <method name="1"> > <device name="fence2"/> > </method> > </fence> > </clusternode> > <clusternode name="10.42.21.27" nodeid="2" votes="1"> > <fence> > <method name="1"> > <device name="fence1"/> > </method> > </fence> > </clusternode> > </clusternodes> > <cman expected_votes="1" two_node="1"/> > <fencedevices> > <fencedevice agent="fence_ipmilan" ipaddr="10.42.21.28" > login="admin" name="fence1" passwd="admin"/> > <fencedevice agent="fence_ipmilan" ipaddr="10.42.21.30" > login="admin" name="fence2" passwd="admin"/> > </fencedevices> > <rm> > <failoverdomains> > <failoverdomain name="myfd" nofailback="0" ordered="1" restricted="0"> > <failoverdomainnode name="10.42.21.29" priority="2"/> > <failoverdomainnode name="10.42.21.27" priority="1"/> > </failoverdomain> > </failoverdomains> > <resources/> > <vm autostart="1" domain="myfd" exclusive="0" migrate="live" > name="linux64" path="/guest_roots" recovery="restart"/> > </rm> > </cluster> > ------ > > > Here: > > 10.42.21.28 is IPMI interface in node3 > 10.42.21.30 is IPMI interface in node4 > > > > > > > > >> >> When you pull out the network cable *and* plug it back in say node 3, >> , what messages appear in the /var/log/messages if Node 4 (if any)? >> (sorry for the repitition, but messages are necessary here to make any >> sense of the situation) >> > > Ok here is the log in node 4 after i disconnect the network cable in node3. > > ----------- > > Jan 5 12:05:24 ha2lx openais[4988]: [TOTEM] The token was lost in the > OPERATIONAL state. > Jan 5 12:05:24 ha2lx openais[4988]: [TOTEM] Receive multicast socket > recv buffer size (288000 bytes). > Jan 5 12:05:24 ha2lx openais[4988]: [TOTEM] Transmit multicast socket > send buffer size (262142 bytes). > Jan 5 12:05:24 ha2lx openais[4988]: [TOTEM] entering GATHER state from 2. > Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] entering GATHER state from 0. > Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] Creating commit token > because I am the rep. > Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] Saving state aru 76 high > seq received 76 > Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] Storing new sequence id > for ring ac > Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] entering COMMIT state. > Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] entering RECOVERY state. > Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] position [0] member 10.42.21.29: > Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] previous ring seq 168 rep > 10.42.21.27 > Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] aru 76 high delivered 76 > received flag 1 > Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] Did not need to originate > any messages in recovery. > Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] Sending initial ORF token > Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] CLM CONFIGURATION CHANGE > Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] New Configuration: > Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] r(0) ip(10.42.21.29) > Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] Members Left: > Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] r(0) ip(10.42.21.27) > Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] Members Joined: > Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] CLM CONFIGURATION CHANGE > Jan 5 12:05:28 ha2lx kernel: dlm: closing connection to node 2 > Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] New Configuration: > Jan 5 12:05:28 ha2lx fenced[5004]: 10.42.21.27 not a cluster member > after 0 sec post_fail_delay > Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] r(0) ip(10.42.21.29) > Jan 5 12:05:28 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: > jid=1: Trying to acquire journal lock... > Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] Members Left: > Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] Members Joined: > Jan 5 12:05:28 ha2lx openais[4988]: [SYNC ] This node is within the > primary component and will provide service. > Jan 5 12:05:28 ha2lx openais[4988]: [TOTEM] entering OPERATIONAL state. > Jan 5 12:05:28 ha2lx openais[4988]: [CLM ] got nodejoin message 10.42.21.29 > Jan 5 12:05:28 ha2lx openais[4988]: [CPG ] got joinlist message from node 1 > Jan 5 12:05:28 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: > jid=1: Looking at journal... > Jan 5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: > jid=1: Acquiring the transaction lock... > Jan 5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: > jid=1: Replaying journal... > Jan 5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: > jid=1: Replayed 0 of 0 blocks > Jan 5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: > jid=1: Found 0 revoke tags > Jan 5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: > jid=1: Journal replayed in 1s > Jan 5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: jid=1: Done > ------------------ > > Now when I plug back my cable to node3, node 4 reboots and here is the > quickly grabbed log in node4 > > > -- > Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] entering GATHER state from 11. > Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] Saving state aru 1d high > seq received 1d > Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] Storing new sequence id > for ring b0 > Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] entering COMMIT state. > Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] entering RECOVERY state. > Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] position [0] member 10.42.21.27: > Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] previous ring seq 172 rep > 10.42.21.27 > Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] aru 16 high delivered 16 > received flag 1 > Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] position [1] member 10.42.21.29: > Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] previous ring seq 172 rep > 10.42.21.29 > Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] aru 1d high delivered 1d > received flag 1 > Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] Did not need to originate > any messages in recovery. > Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] CLM CONFIGURATION CHANGE > Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] New Configuration: > Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] r(0) ip(10.42.21.29) > Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] Members Left: > Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] Members Joined: > Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] CLM CONFIGURATION CHANGE > Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] New Configuration: > Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] r(0) ip(10.42.21.27) > Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] r(0) ip(10.42.21.29) > Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] Members Left: > Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] Members Joined: > Jan 5 12:07:12 ha2lx openais[4988]: [CLM ] r(0) ip(10.42.21.27) > Jan 5 12:07:12 ha2lx openais[4988]: [SYNC ] This node is within the > primary component and will provide service. > Jan 5 12:07:12 ha2lx openais[4988]: [TOTEM] entering OPERATIONAL state. > Jan 5 12:07:12 ha2lx openais[4988]: [MAIN ] Killing node 10.42.21.27 > because it has rejoined the cluster with existing state > Jan 5 12:07:12 ha2lx openais[4988]: [CMAN ] cman killed by node 2 > because we rejoined the cluster without a full restart > Jan 5 12:07:12 ha2lx gfs_controld[5016]: groupd_dispatch error -1 errno 11 > Jan 5 12:07:12 ha2lx gfs_controld[5016]: groupd connection died > Jan 5 12:07:12 ha2lx gfs_controld[5016]: cluster is down, exiting > Jan 5 12:07:12 ha2lx dlm_controld[5010]: cluster is down, exiting > Jan 5 12:07:12 ha2lx kernel: dlm: closing connection to node 1 > Jan 5 12:07:12 ha2lx fenced[5004]: cluster is down, exiting > ------- > > > Also here is the log of node3: > > -- > [root@ha1lx ~]# tail -f /var/log/messages > Jan 5 12:07:24 ha1lx openais[26029]: [TOTEM] entering OPERATIONAL state. > Jan 5 12:07:24 ha1lx openais[26029]: [CLM ] got nodejoin message 10.42.21.27 > Jan 5 12:07:24 ha1lx openais[26029]: [CLM ] got nodejoin message 10.42.21.27 > Jan 5 12:07:24 ha1lx openais[26029]: [CPG ] got joinlist message from node 2 > Jan 5 12:07:27 ha1lx ccsd[26019]: Attempt to close an unopened CCS > descriptor (4520670). > Jan 5 12:07:27 ha1lx ccsd[26019]: Error while processing disconnect: > Invalid request descriptor > Jan 5 12:07:27 ha1lx fenced[26045]: fence "10.42.21.29" success > Jan 5 12:07:27 ha1lx kernel: GFS2: fsid=ipmicluster:guest_roots.1: > jid=0: Trying to acquire journal lock... > Jan 5 12:07:27 ha1lx kernel: GFS2: fsid=ipmicluster:guest_roots.1: > jid=0: Looking at journal... > Jan 5 12:07:28 ha1lx kernel: GFS2: fsid=ipmicluster:guest_roots.1: jid=0: Done > ---------------- > > > > > > > > > > > > >> HTH >> >> With warm regards >> >> Rajagopal >> >> -- >> Linux-cluster mailing list >> Linux-cluster@xxxxxxxxxx >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > Thanks a lot > > Paras. > In an act to solve my fencing issue in my 2 node cluster, i tried to run fence_ipmi to check if fencing is working or not. I need to know what is my problem - [root@ha1lx ~]# fence_ipmilan -a 10.42.21.28 -o off -l admin -p admin Powering off machine @ IPMI:10.42.21.28...ipmilan: Failed to connect after 30 seconds Failed [root@ha1lx ~]# --------------- Here 10.42.21.28 is an IP address assigned to IPMI interface and I am running this command in the same host. Thanks Paras. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster