Re: Problem with service migration with xen domU on diferent dom0 with redhat 5.4

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Problem solved:

In this case I had several network related problems, now I have migration and failover. Something very usefull was to get multicast communication from virbr0 and also choosing a multicast address. Another important facto was that in this version, it not recognized the xen domain names as usual, I had to put -U xen:/// only with this added option the fence got executed.

In rc.local of dom0 I put:

/sbin/fence_xvmd -LX -I virbr0 -U xen:/// -a 224.0.0.1

On cluster.conf  I especified the address 224.0.0.1 and also genate two different fence_xvm.key for both hosts. My cluster.conf is:

<?xml version="1.0"?>
<cluster alias="clusterapache01" config_version="87" name="clusterapache01">
    <clusternodes>
        <clusternode name="vmapache1.foo.com" nodeid="1" votes="1">
            <fence>
                <method name="1">
                    <device domain="vmapache1" name="xenfence1"/>
                </method>
            </fence>
            <multicast addr="224.0.0.1"/>
        </clusternode>
        <clusternode name="vmapache2.foo.com" nodeid="2" votes="1">
            <fence>
                <method name="1">
                    <device domain="vmapache2" name="xenfence2"/>
                </method>
            </fence>
            <multicast addr="224.0.0.1"/>
        </clusternode>
    </clusternodes>
    <cman expected_votes="3">
        <multicast addr="224.0.0.1"/>
    </cman>
    <rm log_level="7">
        <failoverdomains>
            <failoverdomain name="prefer_node1" nofailback="1" ordered="1" restricted="1">
                <failoverdomainnode name="vmapache1.foo.com" priority="1"/>
                <failoverdomainnode name="vmapache2.foo.com" priority="2"/>
            </failoverdomain>
        </failoverdomains>
        <resources>
            <ip address="172.19.52.120" monitor_link="1"/>
            <apache config_file="conf/httpd.conf" name="web1" server_root="/etc/httpd" shutdown_wait="0"/>
            <script file="/etc/init.d/httpd" name="httpd"/>
        </resources>
        <service autostart="1" domain="prefer_node1" exclusive="1" name="web-scs" recovery="relocate">
            <ip ref="172.19.52.120"/>
            <script ref="httpd"/>
        </service>
    </rm>
    <totem consensus="4800" join="60" token="10000" token_retransmits_before_loss_const="20"/>
    <fencedevices>
        <fencedevice agent="fence_xvm" key_file="/etc/cluster/host-1.key" name="xenfence1"/>
        <fencedevice agent="fence_xvm" key_file="/etc/cluster/host-2.key" name="xenfence2"/>
    </fencedevices>
    <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
    <quorumd device="/dev/sda1" interval="2" min_score="1" tko="10" votes="1">
        <heuristic interval="2" program="ping -c1 -t1 172.19.52.119" score="1"/>
    </quorumd>
    <fence_xvmd/>
</cluster>

Now, the only thing I would like to do is to add a fabric fence as a fence backup when dom0 goes down. Someone has experience with 3com or DLink switche to perform as fabric fence?

Best Regards,



Carlos Vermejo Ruiz

----- Mensaje original -----
De: "Carlos VERMEJO RUIZ" <cvermejo@xxxxxxxxxxxxxxxxxxxxxxx>
Para: linux-cluster@xxxxxxxxxx
Enviados: Lunes, 10 de Mayo 2010 22:42:42
Asunto: Re: Problem with service migration with xen domU on diferent dom0 with redhat 5.4

I just come back from a trip and made some changes at my cluster.conf but now I am getting a more clear error:

May 10 20:27:23 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor
May 10 20:27:23 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed

Also I got more information telling me that cluster services on node 1 are down, when I restart rgmanager it starts working.

More details:

[root@vmapache2 ~]# service rgmanager status
Se está ejecutando clurgmgrd (pid  1866)...
[root@vmapache2 ~]# cman_tool status
Version: 6.2.0
Config Version: 60
Cluster Name: clusterapache01
Cluster Id: 38965
Cluster Member: Yes
Cluster Generation: 300
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Quorum device votes: 1
Total votes: 3
Quorum: 2
Active subsystems: 10
Flags: Dirty
Ports Bound: 0 11 177
Node name: vmapache2.foo.com
Node ID: 2
Multicast addresses: 225.0.0.1
Node addresses: 172.19.168.122
[root@vmapache2 ~]#
 
/Var/log/messages
 
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] got nodejoin message 172.19.168.121
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] got nodejoin message 172.19.168.122
May 10 20:27:07 vmapache2 openais[1562]: [CPG  ] got joinlist message from node 2
May 10 20:27:23 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response
May 10 20:27:23 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (35940).
May 10 20:27:23 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor
May 10 20:27:23 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed
May 10 20:27:29 vmapache2 kernel: dlm: connecting to 1
May 10 20:27:29 vmapache2 kernel: dlm: got connection from 1
May 10 20:27:41 vmapache2 clurgmgrd[1867]: <info> State change: vmapache1.foo.com UP
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] got nodejoin message 172.19.168.121
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] got nodejoin message 172.19.168.122
May 10 20:27:07 vmapache2 openais[1562]: [CPG  ] got joinlist message from node 2
May 10 20:27:23 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response
May 10 20:27:23 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (35940).
May 10 20:27:23 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor
May 10 20:27:23 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed
May 10 20:27:29 vmapache2 kernel: dlm: connecting to 1
May 10 20:27:29 vmapache2 kernel: dlm: got connection from 1
May 10 20:27:41 vmapache2 clurgmgrd[1867]: <info> State change: vmapache1.foo.com UP
 
 
[root@vmapache2 ~]# tail -n 100 /var/log/messages
May 10 20:24:25 vmapache2 openais[1562]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes).
May 10 20:24:25 vmapache2 openais[1562]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
May 10 20:24:25 vmapache2 openais[1562]: [TOTEM] entering GATHER state from 2.
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] entering GATHER state from 0.
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Creating commit token because I am the rep.
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Saving state aru 49 high seq received 49
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Storing new sequence id for ring 128
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] entering COMMIT state.
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] entering RECOVERY state.
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] position [0] member 172.19.168.122:
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] previous ring seq 292 rep 172.19.168.121
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] aru 49 high delivered 49 received flag 1
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Did not need to originate any messages in recovery.
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Sending initial ORF token
May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] CLM CONFIGURATION CHANGE
May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] New Configuration:
May 10 20:24:30 vmapache2 fenced[1620]: vmapache1.foo.com not a cluster member after 0 sec post_fail_delay
May 10 20:24:30 vmapache2 kernel: dlm: closing connection to node 1
May 10 20:24:30 vmapache2 clurgmgrd[1867]: <info> State change: vmapache1.foo.com DOWN
May 10 20:24:30 vmapache2 openais[1562]: [CLM  ]        r(0) ip(172.19.168.122)
May 10 20:24:30 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com"
May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] Members Left:
May 10 20:24:30 vmapache2 openais[1562]: [CLM  ]        r(0) ip(172.19.168.121)
May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] Members Joined:
May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] CLM CONFIGURATION CHANGE
May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] New Configuration:
May 10 20:24:30 vmapache2 openais[1562]: [CLM  ]        r(0) ip(172.19.168.122)
May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] Members Left:
May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] Members Joined:
May 10 20:24:30 vmapache2 openais[1562]: [SYNC ] This node is within the primary component and will provide service.
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] entering OPERATIONAL state.
May 10 20:24:30 vmapache2 openais[1562]: [CLM  ] got nodejoin message 172.19.168.122
May 10 20:24:30 vmapache2 openais[1562]: [CPG  ] got joinlist message from node 2
May 10 20:24:35 vmapache2 clurgmgrd[1867]: <info> Waiting for node #1 to be fenced
May 10 20:24:47 vmapache2 qdiskd[1604]: <info> Assuming master role
May 10 20:24:49 vmapache2 openais[1562]: [CMAN ] lost contact with quorum device
May 10 20:24:49 vmapache2 openais[1562]: [CMAN ] quorum lost, blocking activity
May 10 20:24:49 vmapache2 clurgmgrd[1867]: <emerg> #1: Quorum Dissolved
May 10 20:24:49 vmapache2 qdiskd[1604]: <notice> Writing eviction notice for node 1
May 10 20:24:49 vmapache2 openais[1562]: [CMAN ] quorum regained, resuming activity
May 10 20:24:49 vmapache2 clurgmgrd: [1867]: <info> Stopping Service apache:web1
May 10 20:24:49 vmapache2 clurgmgrd: [1867]: <err> Checking Existence Of File /var/run/cluster/apache/apache:web1.pid [apache:web1] > Failed - File Doesn't Exist
May 10 20:24:49 vmapache2 clurgmgrd: [1867]: <info> Stopping Service apache:web1 > Succeed
May 10 20:24:49 vmapache2 clurgmgrd[1867]: <notice> Quorum Regained
May 10 20:24:49 vmapache2 clurgmgrd[1867]: <info> State change: Local UP
May 10 20:24:51 vmapache2 qdiskd[1604]: <notice> Node 1 evicted
May 10 20:25:00 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response
May 10 20:25:00 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (32130).
May 10 20:25:00 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor
May 10 20:25:00 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed
May 10 20:25:05 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com"
May 10 20:25:36 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response
May 10 20:25:36 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (33270).
May 10 20:25:36 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor
May 10 20:25:36 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed
May 10 20:25:41 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com"
May 10 20:26:11 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response
May 10 20:26:11 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed
May 10 20:26:16 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com"
May 10 20:26:47 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response
May 10 20:26:47 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (35010).
May 10 20:26:47 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor
May 10 20:26:47 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed
May 10 20:26:52 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com"
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] entering GATHER state from 11.
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] Saving state aru 10 high seq received 10
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] Storing new sequence id for ring 12c
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] entering COMMIT state.
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] entering RECOVERY state.
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] position [0] member 172.19.168.121:
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] previous ring seq 296 rep 172.19.168.121
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] aru a high delivered a received flag 1
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] position [1] member 172.19.168.122:
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] previous ring seq 296 rep 172.19.168.122
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] aru 10 high delivered 10 received flag 1
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] Did not need to originate any messages in recovery.
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] CLM CONFIGURATION CHANGE
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] New Configuration:
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ]        r(0) ip(172.19.168.122)
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] Members Left:
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] Members Joined:
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] CLM CONFIGURATION CHANGE
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] New Configuration:
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ]        r(0) ip(172.19.168.121)
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ]        r(0) ip(172.19.168.122)
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] Members Left:
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] Members Joined:
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ]        r(0) ip(172.19.168.121)
May 10 20:27:07 vmapache2 openais[1562]: [SYNC ] This node is within the primary component and will provide service.
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] entering OPERATIONAL state.
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] got nodejoin message 172.19.168.121
May 10 20:27:07 vmapache2 openais[1562]: [CLM  ] got nodejoin message 172.19.168.122
May 10 20:27:07 vmapache2 openais[1562]: [CPG  ] got joinlist message from node 2
May 10 20:27:23 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response
May 10 20:27:23 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (35940).
May 10 20:27:23 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor
May 10 20:27:23 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed
May 10 20:27:29 vmapache2 kernel: dlm: connecting to 1
May 10 20:27:29 vmapache2 kernel: dlm: got connection from 1
May 10 20:27:41 vmapache2 clurgmgrd[1867]: <info> State change: vmapache1.foo.com UP

Here is my cluster.conf file:

<?xml version="1.0"?>
<cluster alias="clusterapache01" config_version="60" name="clusterapache01">
    <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="60"/>
    <clusternodes>
        <clusternode name="vmapache1.foo.com" nodeid="1" votes="1">
            <fence>
                <method name="1">
                    <device domain="vmapache1" name="xenfence1"/>
                </method>
            </fence>
            <multicast addr="225.0.0.1" interface="eth1"/>
        </clusternode>
        <clusternode name="vmapache2.foo.com" nodeid="2" votes="1">
            <fence>
                <method name="1">
                    <device domain="vmapache2" name="xenfence2"/>
                </method>
            </fence>
            <multicast addr="225.0.0.1" interface="eth1"/>
        </clusternode>
    </clusternodes>
    <cman expected_votes="3">
        <multicast addr="225.0.0.1"/>
    </cman>
    <fencedevices>
        <fencedevice agent="fence_xvm" key_file="/etc/cluster/fence_xvm-host1.key" name="xenfence1"/>
        <fencedevice agent="fence_xvm" key_file="/etc/cluster/fence_xvm-host2.key" name="xenfence2"/>
    </fencedevices>
    <rm log_level="7">
        <failoverdomains>
            <failoverdomain name="prefer_node1" nofailback="1" ordered="1" restricted="1">
                <failoverdomainnode name="vmapache1.foo.com" priority="1"/>
                <failoverdomainnode name="vmapache2.foo.com" priority="2"/>
            </failoverdomain>
        </failoverdomains>
        <resources>
            <ip address="172.19.52.120" monitor_link="1"/>
            <netfs export="/data" force_unmount="0" fstype="nfs4" host="172.19.50.114" mountpoint="/var/www/html" name="htdoc" options="rw,no_root_squash"/>
            <apache config_file="conf/httpd.conf" name="web1" server_root="/etc/httpd" shutdown_wait="0"/>
        </resources>
        <service autostart="1" domain="prefer_node1" exclusive="0" name="web-scs" recovery="relocate">
            <ip ref="172.19.52.120"/>
            <apache ref="web1"/>
        </service>
    </rm>
    <fence_xvmd/>
    <totem consensus="4800" join="60" token="10000" token_retransmits_before_loss_const="20"/>
    <quorumd device="/dev/sda1" interval="2" min_score="1" tko="10" votes="1">
        <heuristic interval="2" program="ping -c1 -t1 172.19.52.119" score="1"/>
    </quorumd>
</cluster>


Best Regards,



Carlos Vermejo Ruiz
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux