GlusterFS HA testing feedback

joe at julianfamily.org (Joe Julian) · Tue, 22 Oct 2013 13:13:27 -0700

On 10/22/2013 02:42 AM, Jos? A. Lausuch Sales wrote:
> Hi,
>
> we are currently evaluating GlusterFS for a production environment. 
> Our focus is on the high-availability features of GlusterFS. However, 
> our tests have not worked out well. Hence I am seeking feedback from you.
>
>
> In our planned production environment, Gluster should provide shared 
> storage for VM disk images. So, our very basic initial test setup is 
> as follows:
>
>
> We are using two servers, each providing a single brick of a 
> replicated gluster volume (Gluster 3.4.1). A third server runs a 
> test-VM (Ubuntu 13.04 on QEMU 1.3.0 and libvirt 1.0.3) which uses a 
> disk image file stored on the gluster volume as block device 
> (/dev/vdb). For testing purposes, the root file system of this VM 
> (/dev/vda) is a disk image NOT stored on the gluster volume.
>
>
> To test the high-availability features of gluster under load, we run 
> FIO inside the VM directly on the vdb block device (see configuration 
> below). Up to now, we tested reading only. The test procedure is as 
> follows:
>
> 1.We start FIO inside the VM and observe by means of "top" which of 
> the two servers receives the read requests (i.e., increased CPU load 
> of the glusterd process). Let's say that Server1 has the CPU load by 
> glusterfsd.
>
> 2.While FIO is running, we take down the network of this Server1 and 
> observe if the Server2 takes over.

You're bringing server1 down by taking down the NIC (assuming from #5). 
This does take down the connection but it does so without closing the 
TCP connection. Though this does represent worst-case scenarios, see 
http://joejulian.name/blog/keeping-your-vms-from-going-read-only-when-encountering-a-ping-timeout-in-glusterfs/

>
> 3.This "fail over" works (almost 100% of the times), we see the CPU 
> load from glusterfsd on Server2. As expected, Server1 does not have 
> any load because is "offline".
>
> 4.After a while we bring up the NIC on Server1 again. In this step we 
> realized that the expected behavior is that when bringing up this NIC, 
> this server should take over again (something like active-passive 
> behavior) but this happens only 5-10% of the times.  The CPU load is 
> still on Server2.
I'm not sure I would have that expectation. The second server will have 
taken over the open FD and the reads should come from there. The reads 
for a given fd come from the first-to-respond to the lookup().

>
> 5.After some time, we bring down the NIC on Server2 expecting that 
> Server1 takes over.  This second "fail over" crashes. The VM complains 
> about I/O errors which can only be resolved by restarting the VM and 
> sometimes even removing and creating the volume again.
>
>
> After some test, we realized that if restarting the glusterd daemon 
> (/etc/init.d/glusterd restart) on Server1 after step 3 or before step 
> 4, the Server1 takes over automatically without bringing down Server2 
> or anything like that.
Check the logs for glusterd 
(/var/log/glusterfs/etc-glusterfs-glusterd.vol.log) for clues. Perhaps 
the /way/ you're taking down the NIC is exposing some bug. Perhaps 
instead of taking it down, use iptables or just killall glusterfsd.
>
>
> We tested this using the normal FUSE mount and libgfapi. If using 
> FUSE, the local mount sometimes becomes unavailable (ls shows not more 
> files) if the failover fails.
>
>
> We have a few fundamental questions in this regard:
>
> i) Is Gluster supposed to handle such a scenario or are we making 
> wrong assumptions? Because the only solution we found is to restart 
> the daemon when a network outage occurs, but this is not acceptable in 
> a real scenario with VMs running real applications.
I host my (raw and qcow2) vm images on a gluster volume. Since my 
servers are not expected to hard-crash a lot, I take them down for 
maintenance (kernel updates and such) gracefully, killing the processes 
first. This closes the TCP connections and everything just keeps humming 
along.

>
> ii) What is the recommended configuration in terms of caching (QEMU: 
> cache=none/writethrough/writeback) and direct I/O (FIO and Gluster) to 
> maximize the reliability of the failover process? We varied the 
> parameters but could find a working configuration. Do these parameters 
> have an impact at all?
To the best of my knowledge, none of those should affect reliability.

>
>
>
>
> FIO test specification:
>
> [global]
> direct=1
> ioengine=libaio
> iodepth=4
> filename=/dev/vdb
> runtime=300
> numjobs=1
>
> [maxthroughput]
> rw=read
> bs=16k
>
>
>
> VM configuration:
>
> <domain type='kvm' id='6'>
>   <name>testvm</name>
> <uuid>93877c03-605b-ed67-1ab2-2ba16b5fb6b5</uuid>
>   <memory unit='KiB'>2097152</memory>
>   <currentMemory unit='KiB'>2097152</currentMemory>
>   <vcpu placement='static'>1</vcpu>
>   <os>
>     <type arch='x86_64' machine='pc-1.1'>hvm</type>
>     <boot dev='hd'/>
>   </os>
>   <features>
>     <acpi/>
>     <apic/>
>     <pae/>
>   </features>
>   <clock offset='utc'/>
>   <on_poweroff>destroy</on_poweroff>
>   <on_reboot>restart</on_reboot>
>   <on_crash>restart</on_crash>
>   <devices>
>     <emulator>/usr/bin/kvm</emulator>
>     <disk type='block' device='disk'>
>       <driver name='qemu' type='raw' cache='writethrough'/>
>       <source dev='/mnt/local/io-perf.img'/>
>       <target dev='vda' bus='virtio'/>
>       <alias name='virtio-disk0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x04' 
> function='0x0'/>
>     </disk>
>     <disk type='block' device='disk'>
>       <driver name='qemu' type='raw' cache='writethrough'/>
>       <source dev='/mnt/shared/io-perf-testdisk.img'/>
>       <target dev='vdb' bus='virtio'/>
>       <alias name='virtio-disk1'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x07' 
> function='0x0'/>
>     </disk>
>     <controller type='usb' index='0'>
>       <alias name='usb0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x01' 
> function='0x2'/>
>     </controller>
>     <interface type='network'>
>       <mac address='52:54:00:36:5f:dd'/>
>       <source network='default'/>
>       <target dev='vnet0'/>
>       <model type='virtio'/>
>       <alias name='net0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x03' 
> function='0x0'/>
>     </interface>
>     <input type='mouse' bus='ps2'/>
>     <graphics type='vnc' port='5900' autoport='yes' listen='127.0.0.1'>
>       <listen type='address' address='127.0.0.1'/>
>     </graphics>
>     <video>
>       <model type='cirrus' vram='9216' heads='1'/>
>       <alias name='video0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x02' 
> function='0x0'/>
>     </video>
>     <memballoon model='virtio'>
>       <alias name='balloon0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x05' 
> function='0x0'/>
>     </memballoon>
>   </devices>
>   <seclabel type='none'/>
> </domain>
>
>
>
>
> Thank you very much in advance,
> Jose Lausuch
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20131022/1b8b7c4f/attachment.html>