I have a simple test cluster up and running (RHEL 6 HA) on three vmware
guests. Each vmware guest has 3 vnic's.
After booting a node, I often get a dead rgmanager:
[root@syseng1-vm ~]# service rgmanager status
rgmanager dead but pid file exists
Cluster is otherwise OK
[root@syseng1-vm ~]# clustat
Cluster Status for RHEL6Test @ Thu Mar 17 16:10:38 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
syseng1-vm 1 Online, Local
syseng2-vm 2 Online
syseng3-vm 3 Online
There is a service running on node2 but clustat has no info on that.
[root@syseng1-vm ~]# cman_tool status
Version: 6.2.0
Config Version: 9
Cluster Name: RHEL6Test
Cluster Id: 36258
Cluster Member: Yes
Cluster Generation: 88
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: syseng1-[CENSORED]
Node ID: 1
Multicast addresses: 239.192.141.48
Node addresses: 10.10.16.11
The syslog has some info:
Mar 17 15:47:55 syseng1-vm rgmanager[2463]: Quorum formed
Mar 17 15:47:55 syseng1-vm kernel: dlm: no local IP address has been set
Mar 17 15:47:55 syseng1-vm kernel: dlm: cannot start dlm lowcomms -107
The fix is always the same:
[root@syseng1-vm ~]# service cman restart
Stopping cluster:
Leaving fence domain... [ OK ]
Stopping gfs_controld... [ OK ]
Stopping dlm_controld... [ OK ]
Stopping fenced... [ OK ]
Stopping cman... [ OK ]
Waiting for corosync to shutdown: [ OK ]
Unloading kernel modules... [ OK ]
Unmounting configfs... [ OK ]
Starting cluster:
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Starting gfs_controld... [ OK ]
Unfencing self... [ OK ]
Joining fence domain... [ OK ]
[root@syseng1-vm ~]# service rgmanager restart
Stopping Cluster Service Manager: [ OK ]
Starting Cluster Service Manager: [ OK ]
[root@syseng1-vm ~]# clustat
Cluster Status for RHEL6Test @ Thu Mar 17 16:22:01 2011
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
syseng1-vm 1 Online, Local, rgmanager
syseng2-vm 2 Online, rgmanager
syseng3-vm 3 Online
Service Name Owner
(Last) State
------- ---- -----
------ -----
service:TestDB syseng2-vm
started
Sometimes restarting rgmanager hangs and the node needs to be rebooted.
my cluster.conf:
<?xml version="1.0"?>
<cluster config_version="9" name="RHEL6Test">
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="syseng1-vm" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="syseng1-vm"/>
</method>
</fence>
</clusternode>
<clusternode name="syseng2-vm" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="syseng2-vm"/>
</method>
</fence>
</clusternode>
<clusternode name="syseng3-vm" nodeid="3" votes="1">
<fence>
<method name="1">
<device name="syseng3-vm"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman/>
<fencedevices>
<fencedevice agent="fence_vmware" ipaddr="vcenter-lab"
login="Administrator" name="syseng1-vm" passwd="[CENSORED]"
port="syseng1-vm"/>
<fencedevice agent="fence_vmware" ipaddr="vcenter-lab"
login="Administrator" name="syseng2-vm" passwd="[CENSORED]"
port="syseng2-vm"/>
<fencedevice agent="fence_vmware" ipaddr="vcenter-lab"
login="Administrator" name="syseng3-vm" passwd="[CENSORED]"
port="syseng3-vm"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="AllNodes" nofailback="0" ordered="0" restricted="0">
<failoverdomainnode name="syseng1-vm" priority="1"/>
<failoverdomainnode name="syseng2-vm" priority="1"/>
<failoverdomainnode name="syseng3-vm" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="10.10.16.234" monitor_link="on" sleeptime="10"/>
<fs device="/dev/vgpg/pgsql" fsid="62946" mountpoint="/opt/rg"
name="SharedDisk"/>
<script file="/etc/rc.d/init.d/postgresql" name="postgresql"/>
</resources>
<service autostart="1" domain="AllNodes" exclusive="0" name="TestDB"
recovery="relocate">
<ip ref="10.10.16.234"/>
<fs ref="SharedDisk"/>
<script ref="postgresql"/>
</service>
</rm>
</cluster>
Anyone have any ideas in what is going on?
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster