Re: NFS client hang on attempt to do async blocking posix lock enqueue

Jeff Layton <jlayton@xxxxxxxxxx> · Fri, 8 Feb 2008 15:54:14 -0500

On Fri, 8 Feb 2008 13:49:01 -0500 (EST)
"david m. richter" <richterd@xxxxxxxxxxxxxx> wrote:

> On Fri, 8 Feb 2008, J. Bruce Fields wrote:
> 
> > On Fri, Feb 08, 2008 at 07:15:02AM -0500, Jeff Layton wrote:
> > > On Thu, 7 Feb 2008 18:26:18 -0500
> > > "J. Bruce Fields" <bfields@xxxxxxxxxxxx> wrote:
> > > 
> > > > On Sun, Jan 20, 2008 at 09:58:59AM -0500, Oleg Drokin wrote:
> > > > > Hello!
> > > > >
> > > > > On Jan 18, 2008, at 6:07 PM, J. Bruce Fields wrote:
> > > > >
> > > > >> On Thu, Nov 29, 2007 at 02:41:57PM -0800, Marc Eshel wrote:
> > > > >>> The problem seems to be with the fact that the client and server are 
> > > > >>> on
> > > > >>> the same machine. This test work fine with or without an underlaying 
> > > > >>> fs
> > > > >>> that supports locking when the client and the server are on a  
> > > > >>> different
> > > > >>> machines. Like you said the server is trying to send the grant  
> > > > >>> message to
> > > > >>> the client but for some reason it fails when the client is on the  
> > > > >>> same
> > > > >>> machine.
> > > > >> That *shouldn't* make a difference, so we need to take another look at
> > > > >> this--Oleg, this problem is still unfixed, right?
> > > > >
> > > > > Yes, I just pulled your latest nfs tree and I still can reproduce the  
> > > > > problem.
> > > > 
> > > > OK, we have finally reproduced this problem here, and David's working on
> > > > debugging.  It does indeed seem to only be reproduceable with client and
> > > > server on the same machine.  Thanks for the report....
> > > > 
> > > > --b.
> > > 
> > > It might be worth testing this both with and without the patchset I
> > > posted to linux-nfs recently to take care of the lockd hang. If
> > > lockd is stuck trying to rpc_ping itself then it probably would hang
> > > like this, wouldn't it?
> > 
> > Of course!  Yes, that fits.
> > 
> > --b.
> 
> 	right on, jeff, good catch and thanks for directing my attention 
> to your patches.
> 

Excellent! Glad that took care of it...

> 	i applied them on top of 2.6.23.1 and tested them on a cluster 
> exporting GFS2 over NFS, using oleg's reproducer code.  your patches fix 
> that lockd hang.
> 
> 	in a bit more detail, oleg's reproducer basically gets a 
> whole-file read lock, tests the lock, upgrades to a whole-file exclusive 
> lock, tests the lock, then unlocks.  the problem was that when getting 
> that exclusive lock things would hang.  this only happened when the client 
> and server were on the same machine, and i could reproduce it with NFS 
> exporting GFS2 but not NFS exporting EXT3.
> 
> 

Interesting. It's not clear me why the underlying filesystem would make
any difference there. Though now that I look, it looks like fl_grant
really only gets called from dlm code, and that queues up the block for
an immediate grant callback attempt. So perhaps that's the reason.

-- 
Jeff Layton <jlayton@xxxxxxxxxx>
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html