On setting a lease across a cluster

Matthew Wilcox <matthew@xxxxxx> · Fri, 4 Jan 2008 11:14:15 -0700

Hi Bruce,

The current implementation of vfs_setlease/generic_setlease/etc is a
bit quirky.  I've been thinking it over for the past couple of days,
and I think we need to refactor it to work sensibly.

As you may have noticed, I've been mulling over getting rid of the
BKL in fs/locks.c and the lease interface is particularly problematic.
Here's one reason why:

fcntl_setlease
   lock_kernel
   vfs_setlease
      lock_kernel
      generic_setlease
      unlock_kernel
   fasync_helper
   unlock_kernel

This is perfectly legitimate for the BKL of course, but other spinlocks
can't be acquired recursively in this way.  At first I thought I'd just
push the call to generic_setlease into __vfs_setlease and have two ways
of calling it:

fcntl_setlease
   lock_kernel
   __vfs_setlease
   fasync_helper
   unlock_kernel

vfs_setlease
  lock_kernel
  __vfs_setlease
  unlock_kernel

But generic_setlease allocates memory (GFP_KERNEL) under the lock, so
that's bad for converting to spnlocks later.  Then I thought I'd move
the spinlock acquisition into generic_setlease().  But I can't do that
either as fcntl_setlease() needs to hold the spinlock around the call
to generic_setlease() and fasync_helper().

Then I started to wonder about the current split of functionality between
fcntl_setlease, vfs_setlease and generic_setlease.  The check for no
other process having this file open should be common to all filesystems.
And it should be honoured by NFS (it shouldn't hand out a delegation if
a local user has the file open), so it needs to be in vfs_setlease().

Now I'm worrying about the mechanics of calling out to a filesystem to
perform a lock.  Obviously, a network filesystem is going to have to
sleep waiting for a response from the fileserver, so that precludes the
use of a spinlock held over the call to f_op->setlease.  I'm not keen on
the idea of EXPORT_SYMBOL_GPL() a spinlock -- verifying locking is quite
hard enough without worrying what a filesystem might be doing with it.

So I think we need to refactor the interface, and I'd like to hear your
thoughts on my ideas of how to handle this.

First, have clients of vfs_setlease call lease_alloc() and pass it in,
rather than allocate it on the stack and have this horrible interface
where we may pass back an allocated lock.  This allows us to not allocate
memory (hence sleep) during generic_setlease().

Second, move some of the functionality of generic_setlease() to
vfs_setlease(), as mentioned above.

Third, change the purpose of f_op->setlease.  We can rename it if you
like to catch any out of tree users.  I'd propose using it like this:

vfs_setlease()
   if (f_op->setlease())
      res = f_op->setlease()
      if (res)
         return res;
   lock_kernel()
   generic_setlease()
   unlock_kernel()

fcntl_setlease
   if (f_op->setlease())
      res = f_op->setlease()
      if (res)
         return res;
   lock_kernel
   generic_setlease()
   ... fasync ...
   unlock_kernel

So 'f_op->setlease' now means "Try to get a lease from the fileserver".
We can optimise this a bit to not even call setlease if, say, we already
have a read lease and we're trying to get a second read lease.  But we
have to record our local leases (that way they'll show up in /proc/locks).

I think the combination of these three ideas gives us a sane interface
to the various setlease functions, and let us convert from lock_kernel()
to spin_lock() later.  Have I missed something?  I don't often think
about the needs of cluster filesystems, so I may misunderstand how they
need this operation to work.

At some point, we need to revisit the logic for 'is this file open
by another process' -- it's clearly buggy since it doesn't hold the
inode_lock while checking i_count, so it could race against an __iget().

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html