Re: Problem with service migration with xen domU on diferent dom0 with redhat 5.4

Carlos VERMEJO RUIZ <cvermejo@xxxxxxxxxxxxxxxxxxxxxxx> · Sat, 15 May 2010 19:50:13 -0500 (PET)

Problem solved:

In this case I had several network related problems, now I have migration and failover. Something very usefull was to get multicast communication from virbr0 and also choosing a multicast address. Another important facto was that in this version, it not recognized the xen domain names as usual, I had to put -U xen:/// only with this added option the fence got executed.

In rc.local of dom0 I put:

/sbin/fence_xvmd -LX -I virbr0 
-U xen:/// -a 224.0.0.1

On cluster.conf  I especified the address 224.0.0.1 and also genate two different fence_xvm.key for both hosts. My cluster.conf is:

<?xml version="1.0"?>
<cluster alias="clusterapache01" config_version="87" name="clusterapache01">
    <clusternodes>
        <clusternode name="vmapache1.foo.com" nodeid="1" votes="1">
            <fence>
                <method name="1">
                    <device domain="vmapache1" name="xenfence1"/>
                </method>
            </fence>
            <multicast addr="224.0.0.1"/>
        </clusternode>
        <clusternode name="vmapache2.foo.com" nodeid="2" votes="1">
            <fence>
                <method name="1">
                    <device domain="vmapache2" name="xenfence2"/>
                </method>
            </fence>
            <multicast addr="224.0.0.1"/>
        </clusternode>
    </clusternodes>
    <cman expected_votes="3">
        <multicast addr="224.0.0.1"/>
    </cman>
    <rm log_level="7">
        <failoverdomains>
            <failoverdomain name="prefer_node1" nofailback="1" ordered="1" restricted="1">
                <failoverdomainnode name="vmapache1.foo.com" priority="1"/>
                <failoverdomainnode name="vmapache2.foo.com" priority="2"/>
            </failoverdomain>
        </failoverdomains>
        <resources>
            <ip address="172.19.52.120" monitor_link="1"/>
            <apache config_file="conf/httpd.conf" name="web1" server_root="/etc/httpd" shutdown_wait="0"/>
            <script file="/etc/init.d/httpd" name="httpd"/>
        </resources>
        <service autostart="1" domain="prefer_node1" exclusive="1" name="web-scs" recovery="relocate">
            <ip ref="172.19.52.120"/>
            <script ref="httpd"/>
        </service>
    </rm>
    <totem consensus="4800" join="60" token="10000" token_retransmits_before_loss_const="20"/>
    <fencedevices>
        <fencedevice agent="fence_xvm" key_file="/etc/cluster/host-1.key" name="xenfence1"/>
        <fencedevice agent="fence_xvm" key_file="/etc/cluster/host-2.key" name="xenfence2"/>
    </fencedevices>
    <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
    <quorumd device="/dev/sda1" interval="2" min_score="1" tko="10" votes="1">
        <heuristic interval="2" program="ping -c1 -t1 172.19.52.119" score="1"/>
    </quorumd>
    <fence_xvmd/>
</cluster>

Now, the only thing I would like to do is to add a fabric fence as a fence backup when dom0 goes down. Someone has experience with 3com or DLink switche to perform as fabric fence?

Best Regards,

Carlos Vermejo Ruiz

----- Mensaje original -----
De: "Carlos VERMEJO RUIZ" <cvermejo@xxxxxxxxxxxxxxxxxxxxxxx>
Para: linux-cluster@xxxxxxxxxx
Enviados: Lunes, 10 de Mayo 2010 22:42:42
Asunto: Re: Problem with service migration with xen domU on diferent dom0 with redhat 5.4

I just come back from a trip and made some changes at my cluster.conf but now I am getting a more clear error:

May 10 20:27:23 vmapache2 ccsd[1550]: Error while processing disconnect:
 Invalid request descriptor

May 10 20:27:23 vmapache2 fenced[1620]: fence "vmapache1.foo.com" 
failed

Also I got more information telling me that cluster services on node 1 are down, when I restart rgmanager it starts working.

More details:

[root@vmapache2 ~]# service rgmanager status

Se está ejecutando clurgmgrd (pid  1866)...

[root@vmapache2 ~]# cman_tool status

Version: 6.2.0

Config Version: 60

Cluster Name: clusterapache01

Cluster Id: 38965

Cluster Member: Yes

Cluster Generation: 300

Membership state: Cluster-Member

Nodes: 2

Expected votes: 3

Quorum device votes: 1

Total votes: 3

Quorum: 2

Active subsystems: 10

Flags: Dirty

Ports Bound: 0 11 177

Node name: vmapache2.foo.com

Node ID: 2

Multicast addresses: 225.0.0.1

Node addresses: 172.19.168.122

[root@vmapache2 ~]#

/Var/log/messages

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] got nodejoin message 172.19.168.121

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] got nodejoin message 172.19.168.122

May 10 20:27:07 vmapache2 openais[1562]: [CPG  ] got joinlist message from node 2

May 10 20:27:23 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response

May 10 20:27:23 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (35940).

May 10 20:27:23 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor

May 10 20:27:23 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed

May 10 20:27:29 vmapache2 kernel: dlm: connecting to 1

May 10 20:27:29 vmapache2 kernel: dlm: got connection from 1

May 10 20:27:41 vmapache2 clurgmgrd[1867]: <info> State change: vmapache1.foo.com UP

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] got nodejoin message 172.19.168.121

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] got nodejoin message 172.19.168.122

May 10 20:27:07 vmapache2 openais[1562]: [CPG  ] got joinlist message from node 2

May 10 20:27:23 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response

May 10 20:27:23 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (35940).

May 10 20:27:23 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor

May 10 20:27:23 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed

May 10 20:27:29 vmapache2 kernel: dlm: connecting to 1

May 10 20:27:29 vmapache2 kernel: dlm: got connection from 1

May 10 20:27:41 vmapache2 clurgmgrd[1867]: <info> State change: vmapache1.foo.com UP

[root@vmapache2 ~]# tail -n 100 /var/log/messages

May 10 20:24:25 vmapache2 openais[1562]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).

May 10 20:24:25 vmapache2 openais[1562]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).

May 10 20:24:25 vmapache2 openais[1562]: [TOTEM] entering GATHER state from 2.

May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] entering GATHER state from 0.

May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Creating commit token because I am the rep.

May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Saving state aru 49 high seq received 49

May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Storing new sequence id for ring 128

May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] entering COMMIT state.

May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] entering RECOVERY state.

May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] position [0] member 172.19.168.122:

May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] previous ring seq 292 rep 172.19.168.121

May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] aru 49 high delivered 49 received flag 1

May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Did not need to originate any messages in recovery.

May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Sending initial ORF token

May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] CLM CONFIGURATION CHANGE

May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] New Configuration:

May 10 20:24:30 vmapache2 fenced[1620]: vmapache1.foo.com not a cluster member after 0 sec post_fail_delay

May 10 20:24:30 vmapache2 kernel: dlm: closing connection to node 1

May 10 20:24:30 vmapache2 clurgmgrd[1867]: <info> State change: vmapache1.foo.com DOWN

May 10 20:24:30 vmapache2 openais[1562]: [CLM  ]        r(0) ip(172.19.168.122)

May 10 20:24:30 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com"

May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] Members Left:

May 10 20:24:30 vmapache2 openais[1562]: [CLM  ]        r(0) ip(172.19.168.121)

May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] Members Joined:

May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] CLM CONFIGURATION CHANGE

May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] New Configuration:

May 10 20:24:30 vmapache2 openais[1562]: [CLM  ]        r(0) ip(172.19.168.122)

May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] Members Left:

May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] Members Joined:

May 10 20:24:30 vmapache2 openais[1562]: [SYNC ] This node is within the primary component and will provide service.

May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] entering OPERATIONAL state.

May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] got nodejoin message 172.19.168.122

May 10 20:24:30 vmapache2 openais[1562]: [CPG  ] got joinlist message from node 2

May 10 20:24:35 vmapache2 clurgmgrd[1867]: <info> Waiting for node #1 to be fenced

May 10 20:24:47 vmapache2 qdiskd[1604]: <info> Assuming master role

May 10 20:24:49 vmapache2 openais[1562]: [CMAN ] lost contact with quorum device

May 10 20:24:49 vmapache2 openais[1562]: [CMAN ] quorum lost, blocking activity

May 10 20:24:49 vmapache2 clurgmgrd[1867]: <emerg> #1: Quorum Dissolved

May 10 20:24:49 vmapache2 qdiskd[1604]: <notice> Writing eviction notice for node 1

May 10 20:24:49 vmapache2 openais[1562]: [CMAN ] quorum regained, resuming activity

May 10 20:24:49 vmapache2 clurgmgrd: [1867]: <info> Stopping Service apache:web1

May 10 20:24:49 vmapache2 clurgmgrd: [1867]: <err> Checking Existence Of File /var/run/cluster/apache/apache:web1.pid [apache:web1] > Failed - File Doesn't Exist

May 10 20:24:49 vmapache2 clurgmgrd: [1867]: <info> Stopping Service apache:web1 > Succeed

May 10 20:24:49 vmapache2 clurgmgrd[1867]: <notice> Quorum Regained

May 10 20:24:49 vmapache2 clurgmgrd[1867]: <info> State change: Local UP

May 10 20:24:51 vmapache2 qdiskd[1604]: <notice> Node 1 evicted

May 10 20:25:00 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response

May 10 20:25:00 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (32130).

May 10 20:25:00 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor

May 10 20:25:00 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed

May 10 20:25:05 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com"

May 10 20:25:36 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response

May 10 20:25:36 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (33270).

May 10 20:25:36 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor

May 10 20:25:36 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed

May 10 20:25:41 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com"

May 10 20:26:11 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response

May 10 20:26:11 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed

May 10 20:26:16 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com"

May 10 20:26:47 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response

May 10 20:26:47 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (35010).

May 10 20:26:47 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor

May 10 20:26:47 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed

May 10 20:26:52 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com"

May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] entering GATHER state from 11.

May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] Saving state aru 10 high seq received 10

May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] Storing new sequence id for ring 12c

May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] entering COMMIT state.

May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] entering RECOVERY state.

May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] position [0] member 172.19.168.121:

May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] previous ring seq 296 rep 172.19.168.121

May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] aru a high delivered a received flag 1

May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] position [1] member 172.19.168.122:

May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] previous ring seq 296 rep 172.19.168.122

May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] aru 10 high delivered 10 received flag 1

May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] Did not need to originate any messages in recovery.

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] CLM CONFIGURATION CHANGE

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] New Configuration:

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ]        r(0) ip(172.19.168.122)

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] Members Left:

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] Members Joined:

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] CLM CONFIGURATION CHANGE

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] New Configuration:

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ]        r(0) ip(172.19.168.121)

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ]        r(0) ip(172.19.168.122)

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] Members Left:

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] Members Joined:

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ]        r(0) ip(172.19.168.121)

May 10 20:27:07 vmapache2 openais[1562]: [SYNC ] This node is within the primary component and will provide service.

May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] entering OPERATIONAL state.

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] got nodejoin message 172.19.168.121

May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] got nodejoin message 172.19.168.122

May 10 20:27:07 vmapache2 openais[1562]: [CPG  ] got joinlist message from node 2

May 10 20:27:23 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response

May 10 20:27:23 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (35940).

May 10 20:27:23 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor

May 10 20:27:23 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed

May 10 20:27:29 vmapache2 kernel: dlm: connecting to 1

May 10 20:27:29 vmapache2 kernel: dlm: got connection from 1

May 10 20:27:41 vmapache2 clurgmgrd[1867]: <info> State change: vmapache1.foo.com UP

Here is my cluster.conf file:

<?xml version="1.0"?>
<cluster alias="clusterapache01" config_version="60" name="clusterapache01">
    <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="60"/>
    <clusternodes>
        <clusternode name="vmapache1.foo.com" nodeid="1" votes="1">
            <fence>
                <method name="1">
                    <device domain="vmapache1" name="xenfence1"/>
                </method>
            </fence>
            <multicast addr="225.0.0.1" interface="eth1"/>
        </clusternode>
        <clusternode name="vmapache2.foo.com" nodeid="2" votes="1">
            <fence>
                <method name="1">
                    <device domain="vmapache2" name="xenfence2"/>
                </method>
            </fence>
            <multicast addr="225.0.0.1" interface="eth1"/>
        </clusternode>
    </clusternodes>
    <cman expected_votes="3">
        <multicast addr="225.0.0.1"/>
    </cman>
    <fencedevices>
        <fencedevice agent="fence_xvm" key_file="/etc/cluster/fence_xvm-host1.key" name="xenfence1"/>
        <fencedevice agent="fence_xvm" key_file="/etc/cluster/fence_xvm-host2.key" name="xenfence2"/>
    </fencedevices>
    <rm log_level="7">
        <failoverdomains>
            <failoverdomain name="prefer_node1" nofailback="1" ordered="1" restricted="1">
                <failoverdomainnode name="vmapache1.foo.com" priority="1"/>
                <failoverdomainnode name="vmapache2.foo.com" priority="2"/>
            </failoverdomain>
        </failoverdomains>
        <resources>
            <ip address="172.19.52.120" monitor_link="1"/>
            <netfs export="/data" force_unmount="0" fstype="nfs4" host="172.19.50.114" mountpoint="/var/www/html" name="htdoc" options="rw,no_root_squash"/>
            <apache config_file="conf/httpd.conf" name="web1" server_root="/etc/httpd" shutdown_wait="0"/>
        </resources>
        <service autostart="1" domain="prefer_node1" exclusive="0" name="web-scs" recovery="relocate">
            <ip ref="172.19.52.120"/>
            <apache ref="web1"/>
        </service>
    </rm>
    <fence_xvmd/>
    <totem consensus="4800" join="60" token="10000" token_retransmits_before_loss_const="20"/>
    <quorumd device="/dev/sda1" interval="2" min_score="1" tko="10" votes="1">
        <heuristic interval="2" program="ping -c1 -t1 172.19.52.119" score="1"/>
    </quorumd>
</cluster>

Best Regards,

Carlos Vermejo Ruiz
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster