On Fri, 4 Jan 2008, Matthew Wilcox wrote: > > Hi Bruce, > > The current implementation of vfs_setlease/generic_setlease/etc is a > bit quirky. I've been thinking it over for the past couple of days, > and I think we need to refactor it to work sensibly. > > As you may have noticed, I've been mulling over getting rid of the > BKL in fs/locks.c and the lease interface is particularly problematic. > Here's one reason why: > > fcntl_setlease > lock_kernel > vfs_setlease > lock_kernel > generic_setlease > unlock_kernel > fasync_helper > unlock_kernel > > This is perfectly legitimate for the BKL of course, but other spinlocks > can't be acquired recursively in this way. At first I thought I'd just > push the call to generic_setlease into __vfs_setlease and have two ways > of calling it: > > fcntl_setlease > lock_kernel > __vfs_setlease > fasync_helper > unlock_kernel > > vfs_setlease > lock_kernel > __vfs_setlease > unlock_kernel > > But generic_setlease allocates memory (GFP_KERNEL) under the lock, so > that's bad for converting to spnlocks later. Then I thought I'd move > the spinlock acquisition into generic_setlease(). But I can't do that > either as fcntl_setlease() needs to hold the spinlock around the call > to generic_setlease() and fasync_helper(). > > Then I started to wonder about the current split of functionality between > fcntl_setlease, vfs_setlease and generic_setlease. The check for no > other process having this file open should be common to all filesystems. > And it should be honoured by NFS (it shouldn't hand out a delegation if > a local user has the file open), so it needs to be in vfs_setlease(). > > Now I'm worrying about the mechanics of calling out to a filesystem to > perform a lock. Obviously, a network filesystem is going to have to > sleep waiting for a response from the fileserver, so that precludes the > use of a spinlock held over the call to f_op->setlease. I'm not keen on > the idea of EXPORT_SYMBOL_GPL() a spinlock -- verifying locking is quite > hard enough without worrying what a filesystem might be doing with it. > > So I think we need to refactor the interface, and I'd like to hear your > thoughts on my ideas of how to handle this. > > First, have clients of vfs_setlease call lease_alloc() and pass it in, > rather than allocate it on the stack and have this horrible interface > where we may pass back an allocated lock. This allows us to not allocate > memory (hence sleep) during generic_setlease(). > > Second, move some of the functionality of generic_setlease() to > vfs_setlease(), as mentioned above. > > Third, change the purpose of f_op->setlease. We can rename it if you > like to catch any out of tree users. I'd propose using it like this: fwiw, i've done some work on extending the lease subsystem to help support the full range of requirements for NFSv4 file and directory delegations (e.g., breaking a lease when unlinking a file) and we ended up actually doing most of what you've just suggested here, which i take to be a good sign. most of my refactoring came out of trying to simplify locking and avoid holding locks too long (rather than specifically focusing on cluster-oriented stuff, but the goals dovetail) and your work on getting the BKL out of locks.c entirely is something i really like and look forward to. thanks, d . > vfs_setlease() > if (f_op->setlease()) > res = f_op->setlease() > if (res) > return res; > lock_kernel() > generic_setlease() > unlock_kernel() > > fcntl_setlease > if (f_op->setlease()) > res = f_op->setlease() > if (res) > return res; > lock_kernel > generic_setlease() > ... fasync ... > unlock_kernel > > So 'f_op->setlease' now means "Try to get a lease from the fileserver". > We can optimise this a bit to not even call setlease if, say, we already > have a read lease and we're trying to get a second read lease. But we > have to record our local leases (that way they'll show up in /proc/locks). > > I think the combination of these three ideas gives us a sane interface > to the various setlease functions, and let us convert from lock_kernel() > to spin_lock() later. Have I missed something? I don't often think > about the needs of cluster filesystems, so I may misunderstand how they > need this operation to work. > > At some point, we need to revisit the logic for 'is this file open > by another process' -- it's clearly buggy since it doesn't hold the > inode_lock while checking i_count, so it could race against an __iget(). > > -- > Intel are signing my paycheques ... these opinions are still mine > "Bill, look, we understand that you're interested in selling us this > operating system, but compare it to ours. We can't possibly take such > a retrograde step." > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html