RE: AFR: machine crash hangs other mounts or transport endpoint not connected

"Christopher Hawkins" <chawkins@xxxxxxxxxxxxxxxxxxxx> · Fri, 25 Apr 2008 21:31:00 -0400

I am having the same issue. I'm working on a diskless
node cluster and figured the issue was related to that
since AFR seems to fail over nicely for everyone else...
But it seems I am not alone, so what can I do to help troubleshoot?

I have two servers exporting a brick each, and a client mounting
them both with AFR and no unify. Transport timeout settings
don't seem to make a difference - client is just hung if I power off
or just stop glusterfsd. There is nothing logged on the server side.
I'll use a usb thumb drive for client side logging since any logs in
the ramdisk obviously disappear after the reboot which fixes the hang...
If I get any insight from this I'll report it asap. 

Thanks,
Chris

> Real simple, two bricks on ext3 with user_xattr.  
> It is storage for mailstore.  The issue that I've been 
> battling is that when one of the machines crash, the other 
> machine loses the mailstore with either the transport 
> endpoint disconnect or the glusterfs filesystem is hung.  You 
> cannot do anything with it. 'ls' it, 'df' it, ... nothing.  
> If I try to kill glusterfs/d it just gives me /glusterfsmount 
> busy.  The only recovery at this point is to reboot the good 
> machine as well as the failed machine.  So needing to do that 
> is sort of defeating my purpose of creating this array.  Is 
> there no way that glusterfs can recover from the crash such 
> that things are still good on the other bricks and mounts on 
> other machines? 
> 
> Thanks,
> Gerry