On Thu, Mar 17, 2011 at 10:29 PM, Richard Allen <ra@xxxxx> wrote:
I'm running libvirtd setup on top of KVM/Qemu and I have similar experience to yours. I have to force power off the VMs to be able to reboot them. I also loose quorum from time to time, etc. I also noticed bad performance from gfs2 inside such a setup and I'm starting to think it has something to do with virtualization and there is something that we simply don't know about the cluster manager. Probably some tweaking that is not yet in the docs. I'm using SL6, by the way, which is very, very close to RHEL 6. I unfortunatelly don't have the time to test with CentOS5 on the VMs, or with the most recent Fedora. Probably it is something specific to RHEL 6?
Â
I have a simple test cluster up and running (RHEL 6 HA) on three vmware guests. ÂEach vmware guest has 3 vnic's.
After booting a node, I often get a dead rgmanager:
[root@syseng1-vm ~]# service rgmanager status
rgmanager dead but pid file exists
Cluster is otherwise OK
[root@syseng1-vm ~]# clustat
Cluster Status for RHEL6Test @ Thu Mar 17 16:10:38 2011
Member Status: Quorate
ÂMember Name                        ID  Status
Â------ ---- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ---- ------
Âsyseng1-vm                1 Online, Local
Âsyseng2-vm                2 Online
Âsyseng3-vm                3 Online
There is a service running on node2 but clustat has no info on that.
[root@syseng1-vm ~]# cman_tool status
Version: 6.2.0
Config Version: 9
Cluster Name: RHEL6Test
Cluster Id: 36258
Cluster Member: Yes
Cluster Generation: 88
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: syseng1-[CENSORED]
Node ID: 1
Multicast addresses: 239.192.141.48
Node addresses: 10.10.16.11
The syslog has some info:
Mar 17 15:47:55 syseng1-vm rgmanager[2463]: Quorum formed
Mar 17 15:47:55 syseng1-vm kernel: dlm: no local IP address has been set
Mar 17 15:47:55 syseng1-vm kernel: dlm: cannot start dlm lowcomms -107
The fix is always the same:
[root@syseng1-vm ~]# service cman restart
Stopping cluster:
 Leaving fence domain...                 [ ÂOK Â]
 Stopping gfs_controld...                Â[ ÂOK Â]
 Stopping dlm_controld...                Â[ ÂOK Â]
 Stopping fenced...                   Â[ ÂOK Â]
 Stopping cman...                    Â[ ÂOK Â]
 Waiting for corosync to shutdown:            [ ÂOK Â]
 Unloading kernel modules...               [ ÂOK Â]
 Unmounting configfs...                 Â[ ÂOK Â]
Starting cluster:
 Checking Network Manager...               [ ÂOK Â]
 Global setup...                     [ ÂOK Â]
 Loading kernel modules...                [ ÂOK Â]
 Mounting configfs...                  Â[ ÂOK Â]
 Starting cman...                    Â[ ÂOK Â]
 Waiting for quorum...                  [ ÂOK Â]
 Starting fenced...                   Â[ ÂOK Â]
 Starting dlm_controld...                Â[ ÂOK Â]
 Starting gfs_controld...                Â[ ÂOK Â]
 Unfencing self...                    [ ÂOK Â]
 Joining fence domain...                 [ ÂOK Â]
[root@syseng1-vm ~]# service rgmanager restart
Stopping Cluster Service Manager: Â Â Â Â Â Â Â Â Â Â Â Â Â[ ÂOK Â]
Starting Cluster Service Manager: Â Â Â Â Â Â Â Â Â Â Â Â Â[ ÂOK Â]
[root@syseng1-vm ~]# clustat
Cluster Status for RHEL6Test @ Thu Mar 17 16:22:01 2011
Member Status: Quorate
ÂMember Name                        ID  Status
Â------ ---- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ---- ------
Âsyseng1-vm                1 Online, Local, rgmanager
Âsyseng2-vm                2 Online, rgmanager
Âsyseng3-vm                3 Online
ÂService Name                   Owner (Last)                   State
Â------- ---- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ----- ------ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â -----
Âservice:TestDB                  syseng2-vm         Âstarted
Sometimes restarting rgmanager hangs and the node needs to be rebooted.
I'm running libvirtd setup on top of KVM/Qemu and I have similar experience to yours. I have to force power off the VMs to be able to reboot them. I also loose quorum from time to time, etc. I also noticed bad performance from gfs2 inside such a setup and I'm starting to think it has something to do with virtualization and there is something that we simply don't know about the cluster manager. Probably some tweaking that is not yet in the docs. I'm using SL6, by the way, which is very, very close to RHEL 6. I unfortunatelly don't have the time to test with CentOS5 on the VMs, or with the most recent Fedora. Probably it is something specific to RHEL 6?
Â
my cluster.conf:
<?xml version="1.0"?>
<cluster config_version="9" name="RHEL6Test">
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="syseng1-vm" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="syseng1-vm"/>
</method>
</fence>
</clusternode>
<clusternode name="syseng2-vm" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="syseng2-vm"/>
</method>
</fence>
</clusternode>
<clusternode name="syseng3-vm" nodeid="3" votes="1">
<fence>
<method name="1">
<device name="syseng3-vm"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman/>
<fencedevices>
<fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng1-vm" passwd="[CENSORED]" port="syseng1-vm"/>
<fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng2-vm" passwd="[CENSORED]" port="syseng2-vm"/>
<fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng3-vm" passwd="[CENSORED]" port="syseng3-vm"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="AllNodes" nofailback="0" ordered="0" restricted="0">
<failoverdomainnode name="syseng1-vm" priority="1"/>
<failoverdomainnode name="syseng2-vm" priority="1"/>
<failoverdomainnode name="syseng3-vm" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="10.10.16.234" monitor_link="on" sleeptime="10"/>
<fs device="/dev/vgpg/pgsql" fsid="62946" mountpoint="/opt/rg" name="SharedDisk"/>
<script file="/etc/rc.d/init.d/postgresql" name="postgresql"/>
</resources>
<service autostart="1" domain="AllNodes" exclusive="0" name="TestDB" recovery="relocate">
<ip ref="10.10.16.234"/>
<fs ref="SharedDisk"/>
<script ref="postgresql"/>
</service>
</rm>
</cluster>
Anyone have any ideas in what is going on?
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster