Re: fence in xen

Joel Heenan <joelh@xxxxxxxxxxxxxx> · Tue, 5 Oct 2010 10:27:51 +1100

This : 

"""

[root@clu5 ~]# group_tool

type             level name       id       state

fence            0     default    00010001 JOIN_STOP_WAIT

[1 2 2]

dlm              1     clvmd      00020001 JOIN_STOP_WAIT

[1 2 2]

dlm              1     rgmanager  00010002 none

[1 2]
""

To my understanding this means that fence and dlm for clvm both see two copies of node 2. You'll have to check how this is happening, did cman start twice? Did you manually stop and it and start it?

Try disabling your firewall and get both nodes up in a stable state. The state should all be "none". Once that is complete, then look at trying to do fencing.

Joel

On Fri, Oct 1, 2010 at 11:42 PM, Rakovec Jost <Jost.Rakovec@xxxxxx> wrote:

Hi Joel,

On Fri, 2010-10-01 at 15:09 +1000, Joel Heenan wrote:

> Are you saying that if you manually destroy the guest, then start it

> up it works?

No. I have to destroy both node.

>

> I don't think your problem is with fencing I think its that the two

> guests are not joining correctly. It seems like the fencing part is

> working.

>

> Do the logs in /var/log/messages show that one node succesfully fenced

> the other? What is the output of group_tool on both nodes after they

> have come up, this should help you debug it.

>

yes

Oct  1 11:04:39 clu5 fenced[1541]: fence "clu6.snt.si" success

node1

[root@clu5 ~]# group_tool

type             level name       id       state

fence            0     default    00010001 JOIN_STOP_WAIT

[1 2 2]

dlm              1     clvmd      00020001 JOIN_STOP_WAIT

[1 2 2]

dlm              1     rgmanager  00010002 none

[1 2]

[root@clu5 ~]#

[root@clu5 ~]#

[root@clu5 ~]# group_tool dump fence

1285924843 our_nodeid 1 our_name clu5.snt.si

1285924843 listen 4 member 5 groupd 7

1285924846 client 3: join default

1285924846 delay post_join 3s post_fail 0s

1285924846 added 2 nodes from ccs

1285924846 setid default 65537

1285924846 start default 1 members 1

1285924846 do_recovery stop 0 start 1 finish 0

1285924846 finish default 1

1285924846 stop default

1285924846 start default 2 members 2 1

1285924846 do_recovery stop 1 start 2 finish 1

1285924846 finish default 2

1285924936 stop default

1285924985 client 3: dump

1285925065 client 3: dump

1285925281 client 3: dump

[root@clu5 ~]#

node2

[root@clu6 ~]# group_tool

type             level name     id       state

fence            0     default  00000000 JOIN_STOP_WAIT

[1 2]

dlm              1     clvmd    00000000 JOIN_STOP_WAIT

[1 2]

[root@clu6 ~]#

[root@clu6 ~]#

[root@clu6 ~]# group_tool dump fence

1285924935 our_nodeid 2 our_name clu6.snt.si

1285924935 listen 4 member 5 groupd 7

1285924936 client 3: join default

1285924936 delay post_join 3s post_fail 0s

1285924936 added 2 nodes from ccs

1285925291 client 3: dump

[root@clu6 ~]#

thx

br jost

________________________________________

From: linux-cluster-bounces@xxxxxxxxxx [linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Joel Heenan [joelh@xxxxxxxxxxxxxx]

Sent: Friday, October 01, 2010 7:09 AM

To: linux clustering

Subject: Re:  fence in xen

Are you saying that if you manually destroy the guest, then start it up it works?

I don't think your problem is with fencing I think its that the two guests are not joining correctly. It seems like the fencing part is working.

Do the logs in /var/log/messages show that one node succesfully fenced the other? What is the output of group_tool on both nodes after they have come up, this should help you debug it.

I don't think its relevant but this item from the FAQ may help:

http://sources.redhat.com/cluster/wiki/FAQ/Fencing#fence_stuck

Joel

On Wed, Sep 22, 2010 at 7:08 PM, Rakovec Jost <Jost.Rakovec@xxxxxx<mailto:Jost.Rakovec@xxxxxx>> wrote:

Hi

anybody any idea? Please help!!

now i can fence node but after booting it can't connect in to cluster.

on dom0

 fence_xvmd -LX -I xenbr0 -U xen:/// -fdddddddddddddd

ipv4_connect: Connecting to client

ipv4_connect: Success; fd = 12

Rebooting domain oelcl21...

[REBOOT] Calling virDomainDestroy(0x99cede0)

libvir: Xen error : Domain not found: xenUnifiedDomainLookupByName

[[ XML Domain Info ]]

<domain type='xen' id='41'>

 <name>oelcl21</name>

 <uuid>07e31b27-1ff1-4754-4f58-221e8d2057d6</uuid>

 <memory>1048576</memory>

 <currentMemory>1048576</currentMemory>

 <vcpu>2</vcpu>

 <bootloader>/usr/bin/pygrub</bootloader>

 <os>

   <type>linux</type>

 </os>

 <clock offset='utc'/>

 <on_poweroff>destroy</on_poweroff>

 <on_reboot>restart</on_reboot>

 <on_crash>restart</on_crash>

 <devices>

   <disk type='block' device='disk'>

     <driver name='phy'/>

     <source dev='/dev/vg_datastore/oelcl21'/>

     <target dev='xvda' bus='xen'/>

   </disk>

   <disk type='block' device='disk'>

     <driver name='phy'/>

     <source dev='/dev/vg_datastore/skupni1'/>

     <target dev='xvdb' bus='xen'/>

     <shareable/>

   </disk>

   <interface type='bridge'>

     <mac address='00:16:3e:7c:60:aa'/>

     <source bridge='xenbr0'/>

     <script path='/etc/xen/scripts/vif-bridge'/>

     <target dev='vif41.0'/>

   </interface>

   <console type='pty' tty='/dev/pts/2'>

     <source path='/dev/pts/2'/>

     <target port='0'/>

   </console>

 </devices>

</domain>

[[ XML END ]]

Calling virDomainCreateLinux()..

on domU -node1

fence_xvm -H oelcl21 -ddd

clustat on node1:

[root@oelcl11 ~]# clustat

Cluster Status for cluster2 @ Wed Sep 22 11:04:49 2010

Member Status: Quorate

 Member Name                                        ID   Status

 ------ ----                                        ---- ------

 oelcl11                                                1 Online, Local, rgmanager

 oelcl21                                                2 Online, rgmanager

 Service Name                              Owner (Last)                              State

 ------- ----                              ----- ------                              -----

 service:web                               oelcl11                                   started

[root@oelcl11 ~]#

but node2 it waits for 300s an can 't connect

  Starting daemons... done

  Starting fencing... Sep 22 10:41:06 oelcl21 kernel: eth0: no IPv6 routers present

done

[  OK  ]

[root@oelcl21 ~]# clustat

Cluster Status for cluster2 @ Wed Sep 22 11:04:19 2010

Member Status: Quorate

 Member Name                             ID   Status

 ------ ----                             ---- ------

 oelcl11                                     1 Online

 oelcl21                                     2 Online, Local

[root@oelcl21 ~]#

br

jost

________________________________________

From: linux-cluster-bounces@xxxxxxxxxx<mailto:linux-cluster-bounces@xxxxxxxxxx> [linux-cluster-bounces@xxxxxxxxxx<mailto:linux-cluster-bounces@xxxxxxxxxx>] On Behalf Of Rakovec Jost [Jost.Rakovec@xxxxxx<mailto:Jost.Rakovec@xxxxxx>]

Sent: Monday, September 13, 2010 9:31 AM

To: linux clustering

Subject: Re:  fence in xen

Hi

Q: do fence_xvmd must run also  in domU?

Because I notice that if I run on host when fence_xvmd is running:

[root@oelcl1 ~]# fence_xvm -H oelcl2 -ddd -o null

Debugging threshold is now 3

-- args @ 0x7fffe3f71fb0 --

 args->addr = 225.0.0.12

 args->domain = oelcl2

 args->key_file = /etc/cluster/fence_xvm.key

 args->op = 0

 args->hash = 2

 args->auth = 2

 args->port = 1229

 args->ifindex = 0

 args->family = 2

 args->timeout = 30

 args->retr_time = 20

 args->flags = 0

 args->debug = 3

-- end args --

Reading in key file /etc/cluster/fence_xvm.key into 0x7fffe3f70f60 (4096 max size)

Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1

Sending to 225.0.0.12 via 10.9.131.80

Sending to 225.0.0.12 via 10.9.131.83

Sending to 225.0.0.12 via 192.168.122.1

Waiting for connection from XVM host daemon.

Issuing TCP challenge

Responding to TCP challenge

TCP Exchange + Authentication done...

Waiting for return value from XVM host

Remote: Operation was successful

but if I try to fence ---> reboot then I get:

[root@oelcl1 ~]# fence_xvm -H oelc2

Remote: Operation was successful

[root@oelcl1 ~]#

but host2 is not reboot.

if fence_xvmd is not run on hosts then I get time out.

[root@oelcl1 sysconfig]# fence_xvm -H oelcl2 -ddd -o null

Debugging threshold is now 3

-- args @ 0x7fff1a6b5580 --

 args->addr = 225.0.0.12

 args->domain = oelcl2

 args->key_file = /etc/cluster/fence_xvm.key

 args->op = 0

 args->hash = 2

 args->auth = 2

 args->port = 1229

 args->ifindex = 0

 args->family = 2

 args->timeout = 30

 args->retr_time = 20

 args->flags = 0

 args->debug = 3

-- end args --

Reading in key file /etc/cluster/fence_xvm.key into 0x7fff1a6b4530 (4096 max size)

Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1

Sending to 225.0.0.12 via 10.9.131.80

Waiting for connection from XVM host daemon.

Sending to 225.0.0.12 via 127.0.0.1

Sending to 225.0.0.12 via 10.9.131.80

Waiting for connection from XVM host daemon.

Q: how can I try if multicast is ok?

Q: on which network interface must fence_xvmd run on dom0? I notice that on hosts-domU is:

virbr0    Link encap:Ethernet  HWaddr 00:00:00:00:00:00

         inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0

         inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link

         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

         RX packets:0 errors:0 dropped:0 overruns:0 frame:0

         TX packets:40 errors:0 dropped:0 overruns:0 carrier:0

         collisions:0 txqueuelen:0

         RX bytes:0 (0.0 b)  TX bytes:7212 (7.0 KiB)

also virbr0

and on dom0 guest:

[root@vm5 ~]# fence_xvmd -fdd -I xenbr0

-- args @ 0xbfd26234 --

 args->addr = 225.0.0.12

 args->domain = (null)

 args->key_file = /etc/cluster/fence_xvm.key

 args->op = 2

 args->hash = 2

 args->auth = 2

 args->port = 1229

 args->ifindex = 7

 args->family = 2

 args->timeout = 30

 args->retr_time = 20

 args->flags = 1

 args->debug = 2

-- end args --

Opened ckpt vm_states

My Node ID = 1

Domain                   UUID                                 Owner State

------                   ----                                 ----- -----

Domain-0                 00000000-0000-0000-0000-000000000000 00001 00001

oelcl1                   2a53022c-5836-68f0-4514-02a5a0b07e81 00001 00002

oelcl2                   dd268dd4-f012-e0f7-7c77-aa8a58e1e6ab 00001 00002

oelcman                  09c783bd-9107-0916-ebbf-bd27bcc8babe 00001 00002

Storing oelcl1

Storing oelcl2

[root@vm5 ~]# fence_xvmd -fdd -I virbr0

-- args @ 0xbfd26234 --

 args->addr = 225.0.0.12

 args->domain = (null)

 args->key_file = /etc/cluster/fence_xvm.key

 args->op = 2

 args->hash = 2

 args->auth = 2

 args->port = 1229

 args->ifindex = 7

 args->family = 2

 args->timeout = 30

 args->retr_time = 20

 args->flags = 1

 args->debug = 2

-- end args --

Opened ckpt vm_states

My Node ID = 1

Domain                   UUID                                 Owner State

------                   ----                                 ----- -----

Domain-0                 00000000-0000-0000-0000-000000000000 00001 00001

oelcl1                   2a53022c-5836-68f0-4514-02a5a0b07e81 00001 00002

oelcl2                   dd268dd4-f012-e0f7-7c77-aa8a58e1e6ab 00001 00002

oelcman                  09c783bd-9107-0916-ebbf-bd27bcc8babe 00001 00002

Storing oelcl1

Storing oelcl2

no meter whic interface I take fence is not done.

thx

br jost

_____________________________________

From: linux-cluster-bounces@xxxxxxxxxx<mailto:linux-cluster-bounces@xxxxxxxxxx> [linux-cluster-bounces@xxxxxxxxxx<mailto:linux-cluster-bounces@xxxxxxxxxx>] On Behalf Of Rakovec Jost [Jost.Rakovec@xxxxxx<mailto:Jost.Rakovec@xxxxxx>]

Sent: Saturday, September 11, 2010 6:36 PM

To: linux-cluster@xxxxxxxxxx<mailto:linux-cluster@xxxxxxxxxx>

Subject:  fence in xen

Hi list!

I have a question about fence_xvm.

Situation is:

one physical server with xen --> dom0  with 2 domU. Cluster work fine between domU --reboot, relocate,

I'm using redhat 5.5

Problem is with fence from dom0  with "fence_xvm -H oelcl2" ,  domU is destroyed but when it is booted back domU can't join to the cluster. domU boot very long time --> FENCED_START_TIMEOUT=300

on console I get after the node2 is up:

node2:

INFO: task clurgmgrd:2127 blocked for more than 120 seconds.

"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

clurgmgrd     D 0000000000000010     0  2127   2126                     (NOTLB)

 ffff88006f08dda8  0000000000000286  ffff88007cc0b810  0000000000000000

 0000000000000003  ffff880072009860  ffff880072f6b0c0  00000000000455ec

 ffff880072009a48  ffffffff802649d7

Call Trace:

 [<ffffffff802649d7>] _read_lock_irq+0x9/0x19

 [<ffffffff8021420e>] filemap_nopage+0x193/0x360

 [<ffffffff80263a7e>] __mutex_lock_slowpath+0x60/0x9b

 [<ffffffff80263ac8>] .text.lock.mutex+0xf/0x14

 [<ffffffff88424b64>] :dlm:dlm_new_lockspace+0x2c/0x860

 [<ffffffff80222b08>] __up_read+0x19/0x7f

 [<ffffffff802d0abb>] __kmalloc+0x8f/0x9f

 [<ffffffff8842b6fa>] :dlm:device_write+0x438/0x5e5

 [<ffffffff80217377>] vfs_write+0xce/0x174

 [<ffffffff80217bc4>] sys_write+0x45/0x6e

 [<ffffffff802602f9>] tracesys+0xab/0xb6

between booting on node2:

Starting clvmd: dlm: Using TCP for communications

clvmd startup timed out

[FAILED]

node2:

[root@oelcl2 init.d]# clustat

Cluster Status for cluster1 @ Sat Sep 11 18:11:21 2010

Member Status: Quorate

 Member Name                                                ID   Status

 ------ ----                                                ---- ------

 oelcl1                                                  1 Online

 oelcl2                                                 2 Online, Local

[root@oelcl2 init.d]#

on first node:

[root@oelcl1 ~]# clustat

Cluster Status for cluster1 @ Sat Sep 11 18:12:07 2010

Member Status: Quorate

 Member Name                                                ID   Status

 ------ ----                                                ---- ------

 oelcl1                                                  1 Online, Local, rgmanager

 oelcl2                                                  2 Online, rgmanager

 Service Name                                      Owner (Last)                                      State

 ------- ----                                      ----- ------                                      -----

 service:webby                                     oelcl1                                     started

[root@oelcl1 ~]#

and then I have to destroy both domU on guest and create it back to get node2 work again.

I have use how to on https://access.redhat.com/kb/docs/DOC-5937 and http://sources.redhat.com/cluster/wiki/VMClusterCookbook

cluster config on dom0

<?xml version="1.0"?>

<cluster alias="vmcluster" config_version="1" name="vmcluster">

       <clusternodes>

               <clusternode name="vm5" nodeid="1" votes="1"/>

       </clusternodes>

       <cman/>

       <fencedevices/>

       <rm/>

       <fence_xvmd/>

</cluster>

cluster config on domU

<?xml version="1.0"?>

<cluster alias="cluster1" config_version="49" name="cluster1">

       <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="4"/>

       <clusternodes>

               <clusternode name="oelcl1.name.comi" nodeid="1" votes="1">

                       <fence>

                               <method name="1">

                                       <device domain="oelcl1" name="xenfence1"/>

                               </method>

                       </fence>

               </clusternode>

               <clusternode name="oelcl2.name.com<http://oelcl2.name.com>" nodeid="2" votes="1">

                       <fence>

                               <method name="1">

                                       <device domain="oelcl2" name="xenfence1"/>

                               </method>

                       </fence>

               </clusternode>

       </clusternodes>

       <cman expected_votes="1" two_node="1"/>

       <fencedevices>

               <fencedevice agent="fence_xvm" name="xenfence1"/>

       </fencedevices>

       <rm>

               <failoverdomains>

                       <failoverdomain name="prefer_node1" nofailback="0" ordered="1" restricted="1">

                               <failoverdomainnode name="oelcl1.name.com<http://oelcl1.name.com>" priority="1"/>

                               <failoverdomainnode name="oelcl2.name.com<http://oelcl2.name.com>" priority="2"/>

                       </failoverdomain>

               </failoverdomains>

               <resources>

                       <ip address="xx.xx.xx.xx" monitor_link="1"/>

                       <fs device="/dev/xvdb1" force_fsck="0" force_unmount="0" fsid="8669" fstype="ext3" mountpoint="/var/www/html" name="docroot" self_fence="0"/>

                       <script file="/etc/init.d/httpd" name="apache_s"/>

               </resources>

               <service autostart="1" domain="prefer_node1" exclusive="0" name="webby" recovery="relocate">

                       <ip ref="xx.xx.xx.xx"/>

                       <fs ref="docroot"/>

                       <script ref="apache_s"/>

               </service>

       </rm>

</cluster>

fence proces on dom0

[root@vm5 cluster]# ps -ef |grep fenc

root     18690     1  0 17:40 ?        00:00:00 /sbin/fenced

root     18720     1  0 17:40 ?        00:00:00 /sbin/fence_xvmd -I xenbr0

root     22633 14524  0 18:21 pts/3    00:00:00 grep fenc

[root@vm5 cluster]#

and on domU

[root@oelcl1 ~]# ps -ef|grep fen

root      1523     1  0 17:41 ?        00:00:00 /sbin/fenced

root     13695  2902  0 18:22 pts/0    00:00:00 grep fen

[root@oelcl1 ~]#

Do somebody have any idea why fence don't work?

thx

br

jost

--

Linux-cluster mailing list

Linux-cluster@xxxxxxxxxx<mailto:Linux-cluster@xxxxxxxxxx>

https://www.redhat.com/mailman/listinfo/linux-cluster

--

Linux-cluster mailing list

Linux-cluster@xxxxxxxxxx<mailto:Linux-cluster@xxxxxxxxxx>

https://www.redhat.com/mailman/listinfo/linux-cluster

--

Linux-cluster mailing list

Linux-cluster@xxxxxxxxxx<mailto:Linux-cluster@xxxxxxxxxx>

https://www.redhat.com/mailman/listinfo/linux-cluster

--

Linux-cluster mailing list

Linux-cluster@xxxxxxxxxx

https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster