Dear all: I am new to this list and cluster technology. Anyway, I managed to get a cluster set up based on CentOS 5 with two nodes which worked very well for several months. Even several CentOS update rounds (all within version 5) worked flawlessly. The cluster contains three paravirtualized Xen-based virtual machines in an iSCSI storage vault. Even failover and failback worked perfectly. Cluster control/management was handled by a separate standalone PC running Conga. Both cluster nodes and the adminpc are running CentOS5. After another CentOS upgrade round in October, the cluster wouldn't start anymore. We got that solved (cman would't start, but a newer openais package - 0.80.6 - let us overcome that by manual update), but now the virtual machines always get started on all nodes simultaneously. Furthermore, something in Conga setup also seems to have broken: The Conga webinterface at the separate adminpc can still be accessed, but fails when probing storage (broken ricci/luci communication?) This never happened before the upgrade and we had changed neither hardware nor software configuration during the update. Unfortunately, I don't have access to the testing system anymore (but we *did* a lot of testing before putting the system in production use). I would appreciate if more experienced persons could review our configuration and point out any errors or improvements: The cluster has two nodes (station1, station2) and one standalone PC for administration running Conga (adminpc). The nodes are standard Dell 1950 servers. Main storage location is a Dell storage vault which is accessed via iSCSI and mounted on both nodes as /rootfs/. The file system is GFS2. Furthermore, it provides a quorum partition. Fencing is handled via the included DRAC remote access boards. There are three paravirtualized Xen-based virtual machines (vm_mailserver, vm_ldapserver, vm_adminserver). Their container files are located at /rootfs/vmadminserver etc. The VMs are supposed to start distributed on station1 (vm_mailserver) and station2 (vm_ldapserver, vm_adminserver). Software versions (identical on both nodes): kernel 2.6.18-164.el5xen openais-0.80.6-8.el5 cman-2.0.115-1.el5 rgmanager-2.0.52-1.el5.centos xen-3.0.3-80.el5-3.3 xen-libs-3.0.3-80.el5-3.3 luci-0.12.1-7.3.el5.centos.1 ricci-0.12.1-7.3.el5.centos.1 gfs2-utils-0.1.62-1.el5 Before the CentOS update, the working cluster.conf was: ===quote nonworking cluster.conf=== <?xml version="1.0"?> <cluster alias="example_cluster_1" config_version="81" name="example_cluster_1"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="30"/> <clusternodes> <clusternode name="station1.example.com" nodeid="1" votes="1"> <fence> <method name="1"> <device name="station1_fenced"/> </method> </fence> </clusternode> <clusternode name="station2.example.com" nodeid="2" votes="1"> <fence> <method name="1"> <device name="station2_fenced"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="3" two_node="0"/> <fencedevices> <fencedevice agent="fence_ipmilan" ipaddr="172.16.10.91" login="ipmi_admin" name="station1_fenced" operation="off" passwd="secret"/> <fencedevice agent="fence_ipmilan" ipaddr="172.16.10.92" login="ipmi_admin" name="station2_fenced" operation="off" passwd="secret"/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="bias-station1" nofailback="0" ordered="0" restricted="0"> <failoverdomainnode name="station1.example.com" priority="1"/> </failoverdomain> <failoverdomain name="bias-station2" nofailback="0" ordered="0" restricted="0"> <failoverdomainnode name="station2.example.com" priority="1"/> </failoverdomain> </failoverdomains> <resources/> <vm autostart="1" domain="bias-station1" exclusive="0" migrate="live" name="vm_mailserver" path="/rootfs" recovery="restart"/> <vm autostart="1" domain="bias-station2" exclusive="0" migrate="live" name="vm_ldapserver" path="/rootfs" recovery="restart"/> <vm autostart="1" domain="bias-station2" exclusive="0" migrate="live" name="vm_adminserver" path="/rootfs" recovery="restart"/> </rm> <quorumd interval="3" label="xen_qdisk" min_score="1" tko="23" votes="1"/> </cluster> ===unquote nonworking cluster.conf=== A explained, this configuration worked flawlessly for 10 months. Only after the CentOS update, it started the virtual machines simultaneously on both station1 *and* station2 and not distributed as per the <vm ...> directive. We temporarily worked arounf this problem by changing the autostart parameter to <vm autostart="0" ...>. At least this brought our cluster back to running, but we lost the desired automatic restart should a system hang. And failover also doesn't seem to work anymore. I read several messages on this list where users seem to have had a similar problem. It seems to me as if I had missed the use_virsh="0" statement. Hence my question: Is the following a valid cluster.conf for such a setup (distributed VMs, automatic start, failover/failback): ===quote=== <?xml version="1.0"?> <cluster alias="example_cluster_1" config_version="81" name="example_cluster_1"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="30"/> <clusternodes> <clusternode name="station1.example.com" nodeid="1" votes="1"> <fence> <method name="1"> <device name="station1_fenced"/> </method> </fence> </clusternode> <clusternode name="station2.example.com" nodeid="2" votes="1"> <fence> <method name="1"> <device name="station2_fenced"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="3" two_node="0"/> <fencedevices> <fencedevice agent="fence_ipmilan" ipaddr="172.16.10.91" login="ipmi_admin" name="station1_fenced" operation="off" passwd="secret"/> <fencedevice agent="fence_ipmilan" ipaddr="172.16.10.92" login="ipmi_admin" name="station2_fenced" operation="off" passwd="secret"/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="bias-station1" nofailback="0" ordered="0" restricted="0"> <failoverdomainnode name="station1.example.com" priority="1"/> </failoverdomain> <failoverdomain name="bias-station2" nofailback="0" ordered="0" restricted="0"> <failoverdomainnode name="station2.example.com" priority="1"/> </failoverdomain> </failoverdomains> <resources/> <vm autostart="1" use_virsh="0" domain="bias-station1" exclusive="0" migrate="live" name="vm_mailserver" path="/rootfs" recovery="restart"/> <vm autostart="1" use_virsh="0" domain="bias-station2" exclusive="0" migrate="live" name="vm_ldapserver" path="/rootfs" recovery="restart"/> <vm autostart="1" use_virsh="0" domain="bias-station2" exclusive="0" migrate="live" name="vm_adminserver" path="/rootfs" recovery="restart"/> </rm> <quorumd interval="3" label="xen_qdisk" min_score="1" tko="23" votes="1"/> </cluster> ===unquote=== I am open to further updates/testing and will gladly provide additional details should if needed. But as this setup also contains production systems, I want to avoid any fundamental mistakes/oversights. Needless to say, I would appreciate any feedback/suggestions! Regards, Wolf -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster