Re: Cluster of XEN guests unstable when rebooting a node under CS5.1

Paolo Marini <paolom@xxxxxxxxxxxxx> · Thu, 13 Dec 2007 19:43:23 +0100

Good ! It seems the right solution. Below my answers/comments.

Thanks, Paolo

On Wed, 2007-12-12 at 19:23 +0100, Paolo Marini wrote:

I reiterate the request for help hoping someone has undergone (and 
hopefully solved) the same issues.

I am building up a cluster of XEN Guests with root file system residing 
on a file on an GFS filesystem (iscsi actually).

Each cluster node mounts an GFS file system residing on an iscsi device.

For performance reasons, both the iscsi device and the physical nodes 
(part also of a cluster) use two gigabit ethernet with bonding and LACP. 
For the physical machines, I had to insert a sleep 30 on the 
/etc/init.d/iscsi script before the iscsi login, in order to wait for 
the bond interface to come up, otherwise the iscsi devices are not seen 
and no gfs mount is possible.

Then, going to the cluster of XEN Guests, they work fine, I am able to 
migrate each one to a different physical node without problems on the guest.

When I reboot or fence one of the guests, the guest cluster breaks, e.g. 
the quorum is dissolved and I have to fence ALL the nodes and reboot 
them in order for the cluster to restart.

How many guests - and what are you using for fencing ?

I am using 5 guests - 4 are within a cluster and the remaining one is a 
management node (nagios etc.). I am using fencing with fence_xvm and it 
is correctly configured and working. Each Physical node is a DELL PE860 
with 4 Gb of RAM, one quad XEON and 3 network interfaces, two are used 
for bonding and the third one is reserved for IPMI (which I use for 
fencing of the physical nodes).

The guests configure two network interfaces (eth0 and eth0:0), one is 
for private communications between the nodes and to the iscsi device, 
the other for the public access to the nodes. I am not using VLAN.
Does it have to do with the xen bridge going up and down for a time 
longer than the heartbeat timeout ?

Not sure - it shouldn't be that big of a deal.  If you think that's the
problem try adding:

   <totem token="30000"/>

It seems much more stable. More tests will prove this. By now, xm 
destroy on a guest causes the whole cluster of guests to stay up, detect 
the missing guest, fence successfully it. The machine restarts and 
rejoins the cluster.

to the vm cluster's cluster.conf

-- Lon

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

begin:vcard
fn:Paolo Marini
n:Marini;Paolo
org:Prisma Engineering srl
adr;dom:;;via Petrocchi 4;Milano;Italy;20152
email;internet:paolom@xxxxxxxxxxxxx
tel;work:+39 02 26113507
tel;fax:+39 02 26113597
tel;cell:+39 335 6525835
x-mozilla-html:TRUE
url:http://www.prisma-eng.com
version:2.1
end:vcard

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster