Re: On setting a lease across a cluster

"david m. richter" <richterd@xxxxxxxxxxxxxx> · Fri, 4 Jan 2008 13:55:36 -0500 (EST)

On Fri, 4 Jan 2008, Matthew Wilcox wrote:

> 
> Hi Bruce,
> 
> The current implementation of vfs_setlease/generic_setlease/etc is a
> bit quirky.  I've been thinking it over for the past couple of days,
> and I think we need to refactor it to work sensibly.
> 
> As you may have noticed, I've been mulling over getting rid of the
> BKL in fs/locks.c and the lease interface is particularly problematic.
> Here's one reason why:
> 
> fcntl_setlease
>    lock_kernel
>    vfs_setlease
>       lock_kernel
>       generic_setlease
>       unlock_kernel
>    fasync_helper
>    unlock_kernel
> 
> This is perfectly legitimate for the BKL of course, but other spinlocks
> can't be acquired recursively in this way.  At first I thought I'd just
> push the call to generic_setlease into __vfs_setlease and have two ways
> of calling it:
> 
> fcntl_setlease
>    lock_kernel
>    __vfs_setlease
>    fasync_helper
>    unlock_kernel
> 
> vfs_setlease
>   lock_kernel
>   __vfs_setlease
>   unlock_kernel
> 
> But generic_setlease allocates memory (GFP_KERNEL) under the lock, so
> that's bad for converting to spnlocks later.  Then I thought I'd move
> the spinlock acquisition into generic_setlease().  But I can't do that
> either as fcntl_setlease() needs to hold the spinlock around the call
> to generic_setlease() and fasync_helper().
> 
> Then I started to wonder about the current split of functionality between
> fcntl_setlease, vfs_setlease and generic_setlease.  The check for no
> other process having this file open should be common to all filesystems.
> And it should be honoured by NFS (it shouldn't hand out a delegation if
> a local user has the file open), so it needs to be in vfs_setlease().
> 
> Now I'm worrying about the mechanics of calling out to a filesystem to
> perform a lock.  Obviously, a network filesystem is going to have to
> sleep waiting for a response from the fileserver, so that precludes the
> use of a spinlock held over the call to f_op->setlease.  I'm not keen on
> the idea of EXPORT_SYMBOL_GPL() a spinlock -- verifying locking is quite
> hard enough without worrying what a filesystem might be doing with it.
> 
> So I think we need to refactor the interface, and I'd like to hear your
> thoughts on my ideas of how to handle this.
> 
> First, have clients of vfs_setlease call lease_alloc() and pass it in,
> rather than allocate it on the stack and have this horrible interface
> where we may pass back an allocated lock.  This allows us to not allocate
> memory (hence sleep) during generic_setlease().
> 
> Second, move some of the functionality of generic_setlease() to
> vfs_setlease(), as mentioned above.
> 
> Third, change the purpose of f_op->setlease.  We can rename it if you
> like to catch any out of tree users.  I'd propose using it like this:

	fwiw, i've done some work on extending the lease subsystem to help 
support the full range of requirements for NFSv4 file and directory 
delegations (e.g., breaking a lease when unlinking a file) and we ended up 
actually doing most of what you've just suggested here, which i take to be 
a good sign.

	most of my refactoring came out of trying to simplify locking and 
avoid holding locks too long (rather than specifically focusing on 
cluster-oriented stuff, but the goals dovetail) and your work on getting 
the BKL out of locks.c entirely is something i really like and look 
forward to.

	thanks,

	d
	.

> vfs_setlease()
>    if (f_op->setlease())
>       res = f_op->setlease()
>       if (res)
>          return res;
>    lock_kernel()
>    generic_setlease()
>    unlock_kernel()
> 
> fcntl_setlease
>    if (f_op->setlease())
>       res = f_op->setlease()
>       if (res)
>          return res;
>    lock_kernel
>    generic_setlease()
>    ... fasync ...
>    unlock_kernel
> 
> So 'f_op->setlease' now means "Try to get a lease from the fileserver".
> We can optimise this a bit to not even call setlease if, say, we already
> have a read lease and we're trying to get a second read lease.  But we
> have to record our local leases (that way they'll show up in /proc/locks).
> 
> I think the combination of these three ideas gives us a sane interface
> to the various setlease functions, and let us convert from lock_kernel()
> to spin_lock() later.  Have I missed something?  I don't often think
> about the needs of cluster filesystems, so I may misunderstand how they
> need this operation to work.
> 
> At some point, we need to revisit the logic for 'is this file open
> by another process' -- it's clearly buggy since it doesn't hold the
> inode_lock while checking i_count, so it could race against an __iget().
> 
> -- 
> Intel are signing my paycheques ... these opinions are still mine
> "Bill, look, we understand that you're interested in selling us this
> operating system, but compare it to ours.  We can't possibly take such
> a retrograde step."
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html