Hi, == Background == We are setting up GlusterFS on a compute cluster. Each node has two disk partitions /media/gluster1 and /media/gluster2 which are used for the cluster storage. We are currently using builds from TLA (671 as of now) I have a script to generate GlusterFS client configurations that create AFR instances over pairs of nodes in the cluster, a snippet from our current configuration: # Client definitions volume client-cn2-1 type protocol/client option transport-type tcp/client option remote-host cn2 option remote-subvolume brick1 end-volume volume client-cn2-2 type protocol/client option transport-type tcp/client option remote-host cn2 option remote-subvolume brick2 end-volume volume client-cn3-1 type protocol/client option transport-type tcp/client option remote-host cn3 option remote-subvolume brick1 end-volume volume client-cn3-2 type protocol/client option transport-type tcp/client option remote-host cn3 option remote-subvolume brick2 end-volume ### snip - you get the idea ### # Generated AFR volumes volume afr-cn2-cn3 type cluster/afr subvolumes client-cn2-1 client-cn3-2 end-volume volume afr-cn3-cn4 type cluster/afr subvolumes client-cn3-1 client-cn4-2 end-volume ### and so on ### volume unify type cluster/unify option scheduler rr option namespace namespace subvolumes afr-cn2-cn3 afr-cn3-cn4 afr-cn4-cn5 ... end-volume == Self healing program == I wrote a quick C program (medic) that uses the nftw function and opens all files in a directory tree, and readlinks all symlinks. This seems effective at forcing AFR to heal. == Playing with AFR == We have a test cluster of 6 nodes set up. In this setup, cluster node 2 is involved in 'afr-cn2-cn3' and 'afr-cn7-cn2'. I copy a large directory tree onto the cluster filesystem (such as /usr), then 'cripple' node cn2 by deleting the data from its backends and restarting glusterfsd on that system; to emulate the system going offline/losing data. (at this point, all the data is still available on the filesystem) Running medic over the filesystem mount will now cause the data to be copied back onto cn2's appropriate volumes and all is happy. Opening all files on the filesystem seems a stupid waste of time if you know which volumes have gone down (and when you have over 20TB in hundreds of thousands of files, that is a considerable waste of time), so I looked into mounting the parts of the client translator tree into separate mount points and running medic over those. # mkdir /tmp/glfs # generate_client_conf > /tmp/glusterfs.vol # glusterfs -f /tmp/glusterfs.vol -n afr-cn2-cn3 /tmp/glfs # ls /tmp/glfs home/ [Should be: home/ usr/] A `cd /tmp/glfs/usr/` will succeed and usr/ will be self-healed, but the contents will not. Likewise a `cat /tmp/glfs/usr/include/stdio.h` will output the contents of the file and cause it to be self-healed. Changing the order of the subvolumes to the 'afr-cn2-cn3' volume so that the up to date client is the first volume causes the directory to be correctly listed. This seems to me like a minor-ish bug in cluster/afr's readdir functionality. -- Sam Douglas