Virt-store use case - HA failure issue - suggestions needed

Vince Loschiavo <vloschiavo@xxxxxxxxx> · Thu, 31 Jul 2014 09:22:16 -0700

I'm currently testing Gluster 3.5.1 in a two server QEMU/KVM environment.Centos 6.5:
Two servers (KVM07 & KVM08), Two brick (one brick per server) replicated volume

I've tuned the volume per the documentation here: http://gluster.org/documentation/use_cases/Virt-store-usecase/

I have the gluster volume fuse mounted on KVM07 and KVM08 and am using it to store raw disk images.  

KVM is using the fuse mounted volume as a "dir: Filesystem Directory: storage pool.

With setting dynamic_ownership = 0 in /etc/libvirt/qemu.conf and chown-ing the files to qemu:qemu, live migration works great.

Problem:
If I need to take down one of these servers for maintenance, I live migrate the VMs to the other server. 
service gluster stop
then kill all the remaining gluster and brick processes.

At this point, the VMs die.  The Fuse mount recovers and remains attached to the volume via the other server, but the VIRT disk images are not fully synced.

This causes the VMs to go into a read-only files system state, then kernel panic.  Reboots/restarts of the VMs just cause kernel panics.  This effectively brings down the two node cluster.  

Bringing back up the gluster node / bricks /etc, prompts a self-heal.  Once self-heal is completed, the VMs can boot normally.

Question: is there a better way to accomplish HA with live/running Virt images?  The goal is to be able to bring down any one server in the pair and perform maintenance without interrupting the VMs.

I assume my shutdown process is flawed but haven't been able to find a better process.

Any suggestions are welcome.

-- 
-Vince Loschiavo

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users