Re: fence in xen

Rakovec Jost <Jost.Rakovec@xxxxxx> · Fri, 1 Oct 2010 14:42:33 +0200

Hi Joel,

On Fri, 2010-10-01 at 15:09 +1000, Joel Heenan wrote:
> Are you saying that if you manually destroy the guest, then start it
> up it works?

No. I have to destroy both node.

>
> I don't think your problem is with fencing I think its that the two
> guests are not joining correctly. It seems like the fencing part is
> working.
>
> Do the logs in /var/log/messages show that one node succesfully fenced
> the other? What is the output of group_tool on both nodes after they
> have come up, this should help you debug it.
>
yes
Oct  1 11:04:39 clu5 fenced[1541]: fence "clu6.snt.si" success

node1

[root@clu5 ~]# group_tool
type             level name       id       state
fence            0     default    00010001 JOIN_STOP_WAIT
[1 2 2]
dlm              1     clvmd      00020001 JOIN_STOP_WAIT
[1 2 2]
dlm              1     rgmanager  00010002 none
[1 2]
[root@clu5 ~]#
[root@clu5 ~]#
[root@clu5 ~]# group_tool dump fence
1285924843 our_nodeid 1 our_name clu5.snt.si
1285924843 listen 4 member 5 groupd 7
1285924846 client 3: join default
1285924846 delay post_join 3s post_fail 0s
1285924846 added 2 nodes from ccs
1285924846 setid default 65537
1285924846 start default 1 members 1
1285924846 do_recovery stop 0 start 1 finish 0
1285924846 finish default 1
1285924846 stop default
1285924846 start default 2 members 2 1
1285924846 do_recovery stop 1 start 2 finish 1
1285924846 finish default 2
1285924936 stop default
1285924985 client 3: dump
1285925065 client 3: dump
1285925281 client 3: dump
[root@clu5 ~]#

node2

[root@clu6 ~]# group_tool
type             level name     id       state
fence            0     default  00000000 JOIN_STOP_WAIT
[1 2]
dlm              1     clvmd    00000000 JOIN_STOP_WAIT
[1 2]
[root@clu6 ~]#
[root@clu6 ~]#
[root@clu6 ~]# group_tool dump fence
1285924935 our_nodeid 2 our_name clu6.snt.si
1285924935 listen 4 member 5 groupd 7
1285924936 client 3: join default
1285924936 delay post_join 3s post_fail 0s
1285924936 added 2 nodes from ccs
1285925291 client 3: dump
[root@clu6 ~]#

thx

br jost

________________________________________
From: linux-cluster-bounces@xxxxxxxxxx [linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Joel Heenan [joelh@xxxxxxxxxxxxxx]
Sent: Friday, October 01, 2010 7:09 AM
To: linux clustering
Subject: Re:  fence in xen

Are you saying that if you manually destroy the guest, then start it up it works?

I don't think your problem is with fencing I think its that the two guests are not joining correctly. It seems like the fencing part is working.

Do the logs in /var/log/messages show that one node succesfully fenced the other? What is the output of group_tool on both nodes after they have come up, this should help you debug it.

I don't think its relevant but this item from the FAQ may help:

http://sources.redhat.com/cluster/wiki/FAQ/Fencing#fence_stuck

Joel

On Wed, Sep 22, 2010 at 7:08 PM, Rakovec Jost <Jost.Rakovec@xxxxxx<mailto:Jost.Rakovec@xxxxxx>> wrote:
Hi

anybody any idea? Please help!!

now i can fence node but after booting it can't connect in to cluster.

on dom0

 fence_xvmd -LX -I xenbr0 -U xen:/// -fdddddddddddddd

ipv4_connect: Connecting to client
ipv4_connect: Success; fd = 12
Rebooting domain oelcl21...
[REBOOT] Calling virDomainDestroy(0x99cede0)
libvir: Xen error : Domain not found: xenUnifiedDomainLookupByName
[[ XML Domain Info ]]
<domain type='xen' id='41'>
 <name>oelcl21</name>
 <uuid>07e31b27-1ff1-4754-4f58-221e8d2057d6</uuid>
 <memory>1048576</memory>
 <currentMemory>1048576</currentMemory>
 <vcpu>2</vcpu>
 <bootloader>/usr/bin/pygrub</bootloader>
 <os>
   <type>linux</type>
 </os>
 <clock offset='utc'/>
 <on_poweroff>destroy</on_poweroff>
 <on_reboot>restart</on_reboot>
 <on_crash>restart</on_crash>
 <devices>
   <disk type='block' device='disk'>
     <driver name='phy'/>
     <source dev='/dev/vg_datastore/oelcl21'/>
     <target dev='xvda' bus='xen'/>
   </disk>
   <disk type='block' device='disk'>
     <driver name='phy'/>
     <source dev='/dev/vg_datastore/skupni1'/>
     <target dev='xvdb' bus='xen'/>
     <shareable/>
   </disk>
   <interface type='bridge'>
     <mac address='00:16:3e:7c:60:aa'/>
     <source bridge='xenbr0'/>
     <script path='/etc/xen/scripts/vif-bridge'/>
     <target dev='vif41.0'/>
   </interface>
   <console type='pty' tty='/dev/pts/2'>
     <source path='/dev/pts/2'/>
     <target port='0'/>
   </console>
 </devices>
</domain>

[[ XML END ]]
Calling virDomainCreateLinux()..

on domU -node1

fence_xvm -H oelcl21 -ddd

clustat on node1:

[root@oelcl11 ~]# clustat
Cluster Status for cluster2 @ Wed Sep 22 11:04:49 2010
Member Status: Quorate

 Member Name                                        ID   Status
 ------ ----                                        ---- ------
 oelcl11                                                1 Online, Local, rgmanager
 oelcl21                                                2 Online, rgmanager

 Service Name                              Owner (Last)                              State
 ------- ----                              ----- ------                              -----
 service:web                               oelcl11                                   started
[root@oelcl11 ~]#

but node2 it waits for 300s an can 't connect

  Starting daemons... done
  Starting fencing... Sep 22 10:41:06 oelcl21 kernel: eth0: no IPv6 routers present
done
[  OK  ]

[root@oelcl21 ~]# clustat
Cluster Status for cluster2 @ Wed Sep 22 11:04:19 2010
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 oelcl11                                     1 Online
 oelcl21                                     2 Online, Local

[root@oelcl21 ~]#

br
jost

________________________________________
From: linux-cluster-bounces@xxxxxxxxxx<mailto:linux-cluster-bounces@xxxxxxxxxx> [linux-cluster-bounces@xxxxxxxxxx<mailto:linux-cluster-bounces@xxxxxxxxxx>] On Behalf Of Rakovec Jost [Jost.Rakovec@xxxxxx<mailto:Jost.Rakovec@xxxxxx>]
Sent: Monday, September 13, 2010 9:31 AM
To: linux clustering
Subject: Re:  fence in xen

Hi

Q: do fence_xvmd must run also  in domU?
Because I notice that if I run on host when fence_xvmd is running:

[root@oelcl1 ~]# fence_xvm -H oelcl2 -ddd -o null
Debugging threshold is now 3
-- args @ 0x7fffe3f71fb0 --
 args->addr = 225.0.0.12
 args->domain = oelcl2
 args->key_file = /etc/cluster/fence_xvm.key
 args->op = 0
 args->hash = 2
 args->auth = 2
 args->port = 1229
 args->ifindex = 0
 args->family = 2
 args->timeout = 30
 args->retr_time = 20
 args->flags = 0
 args->debug = 3
-- end args --
Reading in key file /etc/cluster/fence_xvm.key into 0x7fffe3f70f60 (4096 max size)
Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1
Sending to 225.0.0.12 via 10.9.131.80
Sending to 225.0.0.12 via 10.9.131.83
Sending to 225.0.0.12 via 192.168.122.1
Waiting for connection from XVM host daemon.
Issuing TCP challenge
Responding to TCP challenge
TCP Exchange + Authentication done...
Waiting for return value from XVM host
Remote: Operation was successful

but if I try to fence ---> reboot then I get:

[root@oelcl1 ~]# fence_xvm -H oelc2
Remote: Operation was successful
[root@oelcl1 ~]#

but host2 is not reboot.

if fence_xvmd is not run on hosts then I get time out.

[root@oelcl1 sysconfig]# fence_xvm -H oelcl2 -ddd -o null
Debugging threshold is now 3
-- args @ 0x7fff1a6b5580 --
 args->addr = 225.0.0.12
 args->domain = oelcl2
 args->key_file = /etc/cluster/fence_xvm.key
 args->op = 0
 args->hash = 2
 args->auth = 2
 args->port = 1229
 args->ifindex = 0
 args->family = 2
 args->timeout = 30
 args->retr_time = 20
 args->flags = 0
 args->debug = 3
-- end args --
Reading in key file /etc/cluster/fence_xvm.key into 0x7fff1a6b4530 (4096 max size)
Actual key length = 4096 bytesSending to 225.0.0.12 via 127.0.0.1
Sending to 225.0.0.12 via 10.9.131.80
Waiting for connection from XVM host daemon.
Sending to 225.0.0.12 via 127.0.0.1
Sending to 225.0.0.12 via 10.9.131.80
Waiting for connection from XVM host daemon.

Q: how can I try if multicast is ok?

Q: on which network interface must fence_xvmd run on dom0? I notice that on hosts-domU is:

virbr0    Link encap:Ethernet  HWaddr 00:00:00:00:00:00
         inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
         inet6 addr: fe80::200:ff:fe00:0/64 Scope:Link
         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
         RX packets:0 errors:0 dropped:0 overruns:0 frame:0
         TX packets:40 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:0
         RX bytes:0 (0.0 b)  TX bytes:7212 (7.0 KiB)

also virbr0

and on dom0 guest:

[root@vm5 ~]# fence_xvmd -fdd -I xenbr0
-- args @ 0xbfd26234 --
 args->addr = 225.0.0.12
 args->domain = (null)
 args->key_file = /etc/cluster/fence_xvm.key
 args->op = 2
 args->hash = 2
 args->auth = 2
 args->port = 1229
 args->ifindex = 7
 args->family = 2
 args->timeout = 30
 args->retr_time = 20
 args->flags = 1
 args->debug = 2
-- end args --
Opened ckpt vm_states
My Node ID = 1
Domain                   UUID                                 Owner State
------                   ----                                 ----- -----
Domain-0                 00000000-0000-0000-0000-000000000000 00001 00001
oelcl1                   2a53022c-5836-68f0-4514-02a5a0b07e81 00001 00002
oelcl2                   dd268dd4-f012-e0f7-7c77-aa8a58e1e6ab 00001 00002
oelcman                  09c783bd-9107-0916-ebbf-bd27bcc8babe 00001 00002
Storing oelcl1
Storing oelcl2

[root@vm5 ~]# fence_xvmd -fdd -I virbr0
-- args @ 0xbfd26234 --
 args->addr = 225.0.0.12
 args->domain = (null)
 args->key_file = /etc/cluster/fence_xvm.key
 args->op = 2
 args->hash = 2
 args->auth = 2
 args->port = 1229
 args->ifindex = 7
 args->family = 2
 args->timeout = 30
 args->retr_time = 20
 args->flags = 1
 args->debug = 2
-- end args --
Opened ckpt vm_states
My Node ID = 1
Domain                   UUID                                 Owner State
------                   ----                                 ----- -----
Domain-0                 00000000-0000-0000-0000-000000000000 00001 00001
oelcl1                   2a53022c-5836-68f0-4514-02a5a0b07e81 00001 00002
oelcl2                   dd268dd4-f012-e0f7-7c77-aa8a58e1e6ab 00001 00002
oelcman                  09c783bd-9107-0916-ebbf-bd27bcc8babe 00001 00002
Storing oelcl1
Storing oelcl2

no meter whic interface I take fence is not done.

thx

br jost

_____________________________________
From: linux-cluster-bounces@xxxxxxxxxx<mailto:linux-cluster-bounces@xxxxxxxxxx> [linux-cluster-bounces@xxxxxxxxxx<mailto:linux-cluster-bounces@xxxxxxxxxx>] On Behalf Of Rakovec Jost [Jost.Rakovec@xxxxxx<mailto:Jost.Rakovec@xxxxxx>]
Sent: Saturday, September 11, 2010 6:36 PM
To: linux-cluster@xxxxxxxxxx<mailto:linux-cluster@xxxxxxxxxx>
Subject:  fence in xen

Hi list!

I have a question about fence_xvm.

Situation is:

one physical server with xen --> dom0  with 2 domU. Cluster work fine between domU --reboot, relocate,

I'm using redhat 5.5

Problem is with fence from dom0  with "fence_xvm -H oelcl2" ,  domU is destroyed but when it is booted back domU can't join to the cluster. domU boot very long time --> FENCED_START_TIMEOUT=300

on console I get after the node2 is up:

node2:

INFO: task clurgmgrd:2127 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
clurgmgrd     D 0000000000000010     0  2127   2126                     (NOTLB)
 ffff88006f08dda8  0000000000000286  ffff88007cc0b810  0000000000000000
 0000000000000003  ffff880072009860  ffff880072f6b0c0  00000000000455ec
 ffff880072009a48  ffffffff802649d7
Call Trace:
 [<ffffffff802649d7>] _read_lock_irq+0x9/0x19
 [<ffffffff8021420e>] filemap_nopage+0x193/0x360
 [<ffffffff80263a7e>] __mutex_lock_slowpath+0x60/0x9b
 [<ffffffff80263ac8>] .text.lock.mutex+0xf/0x14
 [<ffffffff88424b64>] :dlm:dlm_new_lockspace+0x2c/0x860
 [<ffffffff80222b08>] __up_read+0x19/0x7f
 [<ffffffff802d0abb>] __kmalloc+0x8f/0x9f
 [<ffffffff8842b6fa>] :dlm:device_write+0x438/0x5e5
 [<ffffffff80217377>] vfs_write+0xce/0x174
 [<ffffffff80217bc4>] sys_write+0x45/0x6e
 [<ffffffff802602f9>] tracesys+0xab/0xb6

between booting on node2:

Starting clvmd: dlm: Using TCP for communications
clvmd startup timed out
[FAILED]

node2:

[root@oelcl2 init.d]# clustat
Cluster Status for cluster1 @ Sat Sep 11 18:11:21 2010
Member Status: Quorate

 Member Name                                                ID   Status
 ------ ----                                                ---- ------
 oelcl1                                                  1 Online
 oelcl2                                                 2 Online, Local

[root@oelcl2 init.d]#

on first node:

[root@oelcl1 ~]# clustat
Cluster Status for cluster1 @ Sat Sep 11 18:12:07 2010
Member Status: Quorate

 Member Name                                                ID   Status
 ------ ----                                                ---- ------
 oelcl1                                                  1 Online, Local, rgmanager
 oelcl2                                                  2 Online, rgmanager

 Service Name                                      Owner (Last)                                      State
 ------- ----                                      ----- ------                                      -----
 service:webby                                     oelcl1                                     started
[root@oelcl1 ~]#

and then I have to destroy both domU on guest and create it back to get node2 work again.

I have use how to on https://access.redhat.com/kb/docs/DOC-5937 and http://sources.redhat.com/cluster/wiki/VMClusterCookbook

cluster config on dom0

<?xml version="1.0"?>
<cluster alias="vmcluster" config_version="1" name="vmcluster">
       <clusternodes>
               <clusternode name="vm5" nodeid="1" votes="1"/>
       </clusternodes>
       <cman/>
       <fencedevices/>
       <rm/>
       <fence_xvmd/>
</cluster>

cluster config on domU

<?xml version="1.0"?>
<cluster alias="cluster1" config_version="49" name="cluster1">
       <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="4"/>
       <clusternodes>
               <clusternode name="oelcl1.name.comi" nodeid="1" votes="1">
                       <fence>
                               <method name="1">
                                       <device domain="oelcl1" name="xenfence1"/>
                               </method>
                       </fence>
               </clusternode>
               <clusternode name="oelcl2.name.com<http://oelcl2.name.com>" nodeid="2" votes="1">
                       <fence>
                               <method name="1">
                                       <device domain="oelcl2" name="xenfence1"/>
                               </method>
                       </fence>
               </clusternode>
       </clusternodes>
       <cman expected_votes="1" two_node="1"/>
       <fencedevices>
               <fencedevice agent="fence_xvm" name="xenfence1"/>
       </fencedevices>
       <rm>
               <failoverdomains>
                       <failoverdomain name="prefer_node1" nofailback="0" ordered="1" restricted="1">
                               <failoverdomainnode name="oelcl1.name.com<http://oelcl1.name.com>" priority="1"/>
                               <failoverdomainnode name="oelcl2.name.com<http://oelcl2.name.com>" priority="2"/>
                       </failoverdomain>
               </failoverdomains>
               <resources>
                       <ip address="xx.xx.xx.xx" monitor_link="1"/>
                       <fs device="/dev/xvdb1" force_fsck="0" force_unmount="0" fsid="8669" fstype="ext3" mountpoint="/var/www/html" name="docroot" self_fence="0"/>
                       <script file="/etc/init.d/httpd" name="apache_s"/>
               </resources>
               <service autostart="1" domain="prefer_node1" exclusive="0" name="webby" recovery="relocate">
                       <ip ref="xx.xx.xx.xx"/>
                       <fs ref="docroot"/>
                       <script ref="apache_s"/>
               </service>
       </rm>
</cluster>

fence proces on dom0

[root@vm5 cluster]# ps -ef |grep fenc
root     18690     1  0 17:40 ?        00:00:00 /sbin/fenced
root     18720     1  0 17:40 ?        00:00:00 /sbin/fence_xvmd -I xenbr0
root     22633 14524  0 18:21 pts/3    00:00:00 grep fenc
[root@vm5 cluster]#

and on domU

[root@oelcl1 ~]# ps -ef|grep fen
root      1523     1  0 17:41 ?        00:00:00 /sbin/fenced
root     13695  2902  0 18:22 pts/0    00:00:00 grep fen
[root@oelcl1 ~]#

Do somebody have any idea why fence don't work?

thx

br

jost

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx<mailto:Linux-cluster@xxxxxxxxxx>
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx<mailto:Linux-cluster@xxxxxxxxxx>
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx<mailto:Linux-cluster@xxxxxxxxxx>
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster