Re: On setting a lease across a cluster

"J. Bruce Fields" <bfields@xxxxxxxxxxxx> · Fri, 4 Jan 2008 14:47:18 -0500

On Fri, Jan 04, 2008 at 01:55:36PM -0500, david m. richter wrote:
> On Fri, 4 Jan 2008, Matthew Wilcox wrote:
> 
> > 
> > Hi Bruce,
> > 
> > The current implementation of vfs_setlease/generic_setlease/etc is a
> > bit quirky.  I've been thinking it over for the past couple of days,
> > and I think we need to refactor it to work sensibly.

OK, great.

> > 
> > As you may have noticed, I've been mulling over getting rid of the
> > BKL in fs/locks.c and the lease interface is particularly problematic.
> > Here's one reason why:
> > 
> > fcntl_setlease
> >    lock_kernel
> >    vfs_setlease
> >       lock_kernel
> >       generic_setlease
> >       unlock_kernel
> >    fasync_helper
> >    unlock_kernel
> > 
> > This is perfectly legitimate for the BKL of course, but other spinlocks
> > can't be acquired recursively in this way.  At first I thought I'd just
> > push the call to generic_setlease into __vfs_setlease and have two ways
> > of calling it:
> > 
> > fcntl_setlease
> >    lock_kernel
> >    __vfs_setlease
> >    fasync_helper
> >    unlock_kernel
> > 
> > vfs_setlease
> >   lock_kernel
> >   __vfs_setlease
> >   unlock_kernel
> > 
> > But generic_setlease allocates memory (GFP_KERNEL) under the lock, so
> > that's bad for converting to spnlocks later.  Then I thought I'd move
> > the spinlock acquisition into generic_setlease().  But I can't do that
> > either as fcntl_setlease() needs to hold the spinlock around the call
> > to generic_setlease() and fasync_helper().
> > 
> > Then I started to wonder about the current split of functionality between
> > fcntl_setlease, vfs_setlease and generic_setlease.  The check for no
> > other process having this file open should be common to all filesystems.
> > And it should be honoured by NFS (it shouldn't hand out a delegation if
> > a local user has the file open), so it needs to be in vfs_setlease().

Of course, in the case of a (still unfortunately theoretical) cluster
filesystem implementation, those checks will be insufficient on their
own.

Though the write-lease check especially:

	if ((arg == F_WRLCK)
            && ((atomic_read(&dentry->d_count) > 1)
	                    || (atomic_read(&inode->i_count) > 1)))
		goto out;

seems like a terrible hack, and gives some annoying false positives:

	http://www.ussg.iu.edu/hypermail/linux/kernel/0711.1/2255.html

> > Now I'm worrying about the mechanics of calling out to a filesystem to
> > perform a lock.  Obviously, a network filesystem is going to have to
> > sleep waiting for a response from the fileserver, so that precludes the
> > use of a spinlock held over the call to f_op->setlease.  I'm not keen on
> > the idea of EXPORT_SYMBOL_GPL() a spinlock -- verifying locking is quite
> > hard enough without worrying what a filesystem might be doing with it.

I'm confused as to what the locks are actually protecting.

If they're mainly just protecting the inode's lock list, for example,
then we should let the filesystem call back to helpers in locks.c for
any manipulations of that list and do the locking in those helpers.

> > 
> > So I think we need to refactor the interface, and I'd like to hear your
> > thoughts on my ideas of how to handle this.
> > 
> > First, have clients of vfs_setlease call lease_alloc() and pass it in,
> > rather than allocate it on the stack and have this horrible interface
> > where we may pass back an allocated lock.  This allows us to not allocate
> > memory (hence sleep) during generic_setlease().

Makes sense.

> >
> > Second, move some of the functionality of generic_setlease() to
> > vfs_setlease(), as mentioned above.
> > 
> > Third, change the purpose of f_op->setlease.  We can rename it if you
> > like to catch any out of tree users.  I'd propose using it like this:
> 
> 	fwiw, i've done some work on extending the lease subsystem to help 
> support the full range of requirements for NFSv4 file and directory 
> delegations (e.g., breaking a lease when unlinking a file) and we ended up 
> actually doing most of what you've just suggested here, which i take to be 
> a good sign.

Let's post those patches, even if they're incomplete.

> 	most of my refactoring came out of trying to simplify locking and 
> avoid holding locks too long (rather than specifically focusing on 
> cluster-oriented stuff, but the goals dovetail) and your work on getting 
> the BKL out of locks.c entirely is something i really like and look 
> forward to.
> 
> 	thanks,
> 
> 	d
> 	.
>  
> > vfs_setlease()
> >    if (f_op->setlease())
> >       res = f_op->setlease()
> >       if (res)
> >          return res;
> >    lock_kernel()
> >    generic_setlease()
> >    unlock_kernel()

Why can't the filesystem call into generic_setlease() on its own?

Christoph Hellwig has suggested removing the second case and just doing

	.setlease = generic_setlease

for all filesystems that should support leases; see the final patch in

	git://linux-nfs.org/~bfields/linux.git server-cluster-lease-api

> > 
> > fcntl_setlease
> >    if (f_op->setlease())
> >       res = f_op->setlease()
> >       if (res)
> >          return res;
> >    lock_kernel
> >    generic_setlease()
> >    ... fasync ...
> >    unlock_kernel
> > 
> > So 'f_op->setlease' now means "Try to get a lease from the fileserver".
> > We can optimise this a bit to not even call setlease if, say, we already
> > have a read lease and we're trying to get a second read lease.  But we
> > have to record our local leases (that way they'll show up in /proc/locks).
> > 
> > I think the combination of these three ideas gives us a sane interface
> > to the various setlease functions, and let us convert from lock_kernel()
> > to spin_lock() later.  Have I missed something?  I don't often think
> > about the needs of cluster filesystems, so I may misunderstand how they
> > need this operation to work.
> > 
> > At some point, we need to revisit the logic for 'is this file open
> > by another process' -- it's clearly buggy since it doesn't hold the
> > inode_lock while checking i_count, so it could race against an __iget().

OK, that's something else I hadn't noticed, thanks.

--b.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html