I am having the same issue. I'm working on a diskless node cluster and figured the issue was related to that since AFR seems to fail over nicely for everyone else... But it seems I am not alone, so what can I do to help troubleshoot? I have two servers exporting a brick each, and a client mounting them both with AFR and no unify. Transport timeout settings don't seem to make a difference - client is just hung if I power off or just stop glusterfsd. There is nothing logged on the server side. I'll use a usb thumb drive for client side logging since any logs in the ramdisk obviously disappear after the reboot which fixes the hang... If I get any insight from this I'll report it asap. Thanks, Chris > Real simple, two bricks on ext3 with user_xattr. > It is storage for mailstore. The issue that I've been > battling is that when one of the machines crash, the other > machine loses the mailstore with either the transport > endpoint disconnect or the glusterfs filesystem is hung. You > cannot do anything with it. 'ls' it, 'df' it, ... nothing. > If I try to kill glusterfs/d it just gives me /glusterfsmount > busy. The only recovery at this point is to reboot the good > machine as well as the failed machine. So needing to do that > is sort of defeating my purpose of creating this array. Is > there no way that glusterfs can recover from the crash such > that things are still good on the other bricks and mounts on > other machines? > > Thanks, > Gerry