Re: DLM problem

"C.D." <ccd.stoy.ml@xxxxxxxxx> · Fri, 18 Mar 2011 01:47:19 +0200

On Thu, Mar 17, 2011 at 10:29 PM, Richard Allen <ra@xxxxx> wrote:

I have a simple test cluster up and running (RHEL 6 HA) on three vmware guests. ÂEach vmware guest has 3 vnic's.

After booting a node, I often get a dead rgmanager:

[root@syseng1-vm ~]# service rgmanager status

rgmanager dead but pid file exists

Cluster is otherwise OK

[root@syseng1-vm ~]# clustat

Cluster Status for RHEL6Test @ Thu Mar 17 16:10:38 2011

Member Status: Quorate

ÂMember Name Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ID Â Status

Â------ ---- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ---- ------

Âsyseng1-vm Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 1 Online, Local

Âsyseng2-vm Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 2 Online

Âsyseng3-vm Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 3 Online

There is a service running on node2 but clustat has no info on that.

[root@syseng1-vm ~]# cman_tool status

Version: 6.2.0

Config Version: 9

Cluster Name: RHEL6Test

Cluster Id: 36258

Cluster Member: Yes

Cluster Generation: 88

Membership state: Cluster-Member

Nodes: 3

Expected votes: 3

Total votes: 3

Node votes: 1

Quorum: 2

Active subsystems: 1

Flags:

Ports Bound: 0

Node name: syseng1-[CENSORED]

Node ID: 1

Multicast addresses: 239.192.141.48

Node addresses: 10.10.16.11

The syslog has some info:

Mar 17 15:47:55 syseng1-vm rgmanager[2463]: Quorum formed

Mar 17 15:47:55 syseng1-vm kernel: dlm: no local IP address has been set

Mar 17 15:47:55 syseng1-vm kernel: dlm: cannot start dlm lowcomms -107

The fix is always the same:

[root@syseng1-vm ~]# service cman restart

Stopping cluster:

 Â Leaving fence domain... Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â [ ÂOK Â]

 Â Stopping gfs_controld... Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â[ ÂOK Â]

 Â Stopping dlm_controld... Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â[ ÂOK Â]

 Â Stopping fenced... Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â[ ÂOK Â]

 Â Stopping cman... Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â[ ÂOK Â]

 Â Waiting for corosync to shutdown: Â Â Â Â Â Â Â Â Â Â Â [ ÂOK Â]

 Â Unloading kernel modules... Â Â Â Â Â Â Â Â Â Â Â Â Â Â [ ÂOK Â]

 Â Unmounting configfs... Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â[ ÂOK Â]

Starting cluster:

 Â Checking Network Manager... Â Â Â Â Â Â Â Â Â Â Â Â Â Â [ ÂOK Â]

 Â Global setup... Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â [ ÂOK Â]

 Â Loading kernel modules... Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â [ ÂOK Â]

 Â Mounting configfs... Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â[ ÂOK Â]

 Â Starting cman... Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â[ ÂOK Â]

 Â Waiting for quorum... Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â [ ÂOK Â]

 Â Starting fenced... Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â[ ÂOK Â]

 Â Starting dlm_controld... Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â[ ÂOK Â]

 Â Starting gfs_controld... Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â[ ÂOK Â]

 Â Unfencing self... Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â [ ÂOK Â]

 Â Joining fence domain... Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â [ ÂOK Â]

[root@syseng1-vm ~]# service rgmanager restart

Stopping Cluster Service Manager: Â Â Â Â Â Â Â Â Â Â Â Â Â[ ÂOK Â]

Starting Cluster Service Manager: Â Â Â Â Â Â Â Â Â Â Â Â Â[ ÂOK Â]

[root@syseng1-vm ~]# clustat

Cluster Status for RHEL6Test @ Thu Mar 17 16:22:01 2011

Member Status: Quorate

ÂMember Name Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ID Â Status

Â------ ---- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ---- ------

Âsyseng1-vm Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 1 Online, Local, rgmanager

Âsyseng2-vm Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 2 Online, rgmanager

Âsyseng3-vm Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 3 Online

ÂService Name Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Owner (Last) Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â State

Â------- ---- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ----- ------ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â -----

Âservice:TestDB Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â syseng2-vm Â Â Â Â Â Â Â Â Âstarted

Sometimes restarting rgmanager hangs and the node needs to be rebooted.

I'm running libvirtd setup on top of KVM/Qemu and I have similar experience to yours. I have to force power off the VMs to be able to reboot them. I also loose quorum from time to time, etc. I also noticed bad performance from gfs2 inside such a setup and I'm starting to think it has something to do with virtualization and there is something that we simply don't know about the cluster manager. Probably some tweaking that is not yet in the docs. I'm using SL6, by the way, which is very, very close to RHEL 6. I unfortunatelly don't have the time to test with CentOS5 on the VMs, or with the most recent Fedora. Probably it is something specific to RHEL 6?

Â

my cluster.conf:

<?xml version="1.0"?>

<cluster config_version="9" name="RHEL6Test">

<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>

<clusternodes>

<clusternode name="syseng1-vm" nodeid="1" votes="1">

<fence>

<method name="1">

<device name="syseng1-vm"/>

</method>

</fence>

</clusternode>

<clusternode name="syseng2-vm" nodeid="2" votes="1">

<fence>

<method name="1">

<device name="syseng2-vm"/>

</method>

</fence>

</clusternode>

<clusternode name="syseng3-vm" nodeid="3" votes="1">

<fence>

<method name="1">

<device name="syseng3-vm"/>

</method>

</fence>

</clusternode>

</clusternodes>

<cman/>

<fencedevices>

<fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng1-vm" passwd="[CENSORED]" port="syseng1-vm"/>

<fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng2-vm" passwd="[CENSORED]" port="syseng2-vm"/>

<fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng3-vm" passwd="[CENSORED]" port="syseng3-vm"/>

</fencedevices>

<rm>

<failoverdomains>

<failoverdomain name="AllNodes" nofailback="0" ordered="0" restricted="0">

<failoverdomainnode name="syseng1-vm" priority="1"/>

<failoverdomainnode name="syseng2-vm" priority="1"/>

<failoverdomainnode name="syseng3-vm" priority="1"/>

</failoverdomain>

</failoverdomains>

<resources>

<ip address="10.10.16.234" monitor_link="on" sleeptime="10"/>

<fs device="/dev/vgpg/pgsql" fsid="62946" mountpoint="/opt/rg" name="SharedDisk"/>

<script file="/etc/rc.d/init.d/postgresql" name="postgresql"/>

</resources>

<service autostart="1" domain="AllNodes" exclusive="0" name="TestDB" recovery="relocate">

<ip ref="10.10.16.234"/>

<fs ref="SharedDisk"/>

<script ref="postgresql"/>

</service>

</rm>

</cluster>

Anyone have any ideas in what is going on?

--

Linux-cluster mailing list

Linux-cluster@xxxxxxxxxx

https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster