On Thu, Jun 07, 2012 at 02:36:26PM +0100, Brian Candler wrote: > I'm interested in understanding this, especially the split-brain scenarios > (better to understand them *before* you're stuck in a problem :-) > > BTW I'm in the process of building a 2-node 3.3 test cluster right now. FYI, I have got KVM working with a glusterfs 3.3.0 replicated volume as the image store. There are two nodes, both running as glusterfs storage and as KVM hosts. I build a 10.04 ubuntu image using vmbuilder, stored on the replicated glusterfs volume: vmbuilder kvm ubuntu --hostname lucidtest --mem 512 --debug --rootsize 20480 --dest /gluster/safe/images/lucidtest I was able to fire it up (virsh start lucidtest), ssh into it, and then live-migrate it to another host: brian at dev-storage1:~$ virsh migrate --live lucidtest qemu+ssh://dev-storage2/system brian at dev-storage2's password: brian at dev-storage1:~$ virsh list Id Name State ---------------------------------- brian at dev-storage1:~$ And I live-migrated it back again, all without the ssh session being interrupted. I then rebooted the second storage server. While it was rebooting I did some work in the VM which grew its image. When the second storage server came back, it resynchronised the image immediately and automatically. Here is the relevant entry from /var/log/glusterfs/glustershd.log on the first (non-rebooted) machine: [2012-06-08 17:08:40.817893] E [socket.c:1715:socket_connect_finish] 0-safe-client-1: connection to 10.0.1.2:24009 failed (Connection timed out) [2012-06-08 17:09:10.698272] I [client-handshake.c:1636:select_server_supported_programs] 0-safe-client-1: Using Program GlusterFS 3.3.0, Num (1298437), Version (330) [2012-06-08 17:09:10.700197] I [client-handshake.c:1433:client_setvolume_cbk] 0-safe-client-1: Connected to 10.0.1.2:24009, attached to remote volume '/disk/storage2/safe'. [2012-06-08 17:09:10.700234] I [client-handshake.c:1445:client_setvolume_cbk] 0-safe-client-1: Server and Client lk-version numbers are not same, reopening the fds [2012-06-08 17:09:10.701901] I [client-handshake.c:453:client_set_lk_version_cbk] 0-safe-client-1: Server lk version = 1 [2012-06-08 17:09:14.699571] I [afr-common.c:1189:afr_detect_self_heal_by_iatt] 0-safe-replicate-0: size differs for <gfid:1f080b06-46f1-468e-b21a-12bf4a7c81ff> [2012-06-08 17:09:14.699616] I [afr-common.c:1340:afr_launch_self_heal] 0-safe-replicate-0: background data self-heal triggered. path: <gfid:1f080b06-46f1-468e-b21a-12bf4a7c81ff>, reason: lookup detected pending operations [2012-06-08 17:09:18.230855] I [afr-self-heal-algorithm.c:122:sh_loop_driver_done] 0-safe-replicate-0: diff self-heal on <gfid:1f080b06-46f1-468e-b21a-12bf4a7c81ff>: completed. (19 blocks of 3299 were different (0.58%)) [2012-06-08 17:09:18.232520] I [afr-self-heal-common.c:2159:afr_self_heal_completion_cbk] 0-safe-replicate-0: background data self-heal completed on <gfid:1f080b06-46f1-468e-b21a-12bf4a7c81ff> So at first glance this is extremely impressive. It's also very new and shiny, and I wonder how many edge cases remain to be debugged in live use, but I can't argue that it's very neat indeed! Performance-wise: (1) on the storage/VM host, which has the replicated volume mounted via FUSE: root at dev-storage1:~# dd if=/dev/zero of=/gluster/safe/test.zeros bs=1024k count=500 500+0 records in 500+0 records out 524288000 bytes (524 MB) copied, 2.7086 s, 194 MB/s (The bricks have a 12-disk md RAID10 array, far-2 layout, and there's probably scope for some performance tweaking here) (2) however from within the VM guest, performance was very poor (2.2MB/s). I tried my usual tuning options: <driver name='qemu' type='qcow2' io='native' cache='none'/> ... <target dev='vda' bus='virtio'/> <!-- delete <address type='drive' controller='0' bus='0' unit='0'/> --> but glusterfs objected to the cache='none' option (possibly this opens the file with O_DIRECT?) # virsh start lucidtest virsherror: Failed to start domain lucidtest error: internal error process exited while connecting to monitor: char device redirected to /dev/pts/0 kvm: -drive file=/gluster/safe/images/lucidtest/tmpaJqTD9.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,aio=native: could not open disk image /gluster/safe/images/lucidtest/tmpaJqTD9.qcow2: Invalid argument The VM boots with io='native' and bus='virtio', but performance is still very poor: ubuntu at lucidtest:~$ dd if=/dev/zero of=/var/tmp/test.zeros bs=1024k count=100 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 17.4095 s, 6.0 MB/s This will need some further work. The guest is lucid (10.04) only because for some reason I cannot get a 12.04 image built with vmbuilder to work (it spins at 100% CPU). This is not related to glusterfs and something I need to debug separately. Maybe a 12.04 guest will also run better. Anyway, just thought it was worth a mention. Keep up the good work guys! Regards, Brian.