On Fri, Oct 13, 2006 at 10:18:56AM +0200, Jens Axboe wrote: > > Taking the lowest-addressed lock first is the usual convention, if we really > > have to do that. I'm not aware of anywhere else where we pull this > > trick with i_mutex though. > > It is, but people had concerns with that approach when it was originally > done for pipe tee'ing. Things like lock_rename() look scary. Yeah, I'm not too thrilled with the double locking either. Eventually I think we'll need some _nolock variant of generic_file_splice_write() so that OCFS2 can order cluster locks within i_mutex (much like you see with ocfs2_file_aio_write(), etc). At the very least, I'd rather not have this stuff open coded everywhere. If we consolidate the behavior within a set of functions, we'll be able to avoid questions later about locking order, etc. I felt dirty just typing up this patch. Sigh. Let me know what you think. --Mark From: Mark Fasheh <mark.fasheh@xxxxxxxxxx> [PATCH] Take i_mutex in splice_from_pipe() The splice_actor may be calling ->prepare_write() and ->commit_write(). We want i_mutex on the inode being written to before calling those so that we don't race i_size changes. The double locking behavior is done elsewhere in splice.c, and if we eventually want _nolock variants of generic_file_splice_write(), fs modules might have to replicate the nasty locking code. We introduce inode_double_lock() and inode_double_unlock() to consolidate the locking rules into one set of functions. Signed-off-by: Mark Fasheh <mark.fasheh@xxxxxxxxxx> --- fs/splice.c | 24 +++++++++++------------- include/linux/fs.h | 29 +++++++++++++++++++++++++++++ 2 files changed, 40 insertions(+), 13 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index 13e92dd..4d8a774 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -713,6 +713,7 @@ ssize_t splice_from_pipe(struct pipe_ino { int ret, do_wakeup, err; struct splice_desc sd; + struct inode *inode = out->f_mapping->host; ret = 0; do_wakeup = 0; @@ -722,8 +723,13 @@ ssize_t splice_from_pipe(struct pipe_ino sd.file = out; sd.pos = *ppos; - if (pipe->inode) - mutex_lock(&pipe->inode->i_mutex); + /* + * The actor worker might be calling ->prepare_write and + * ->commit_write. Most of the time, these expect i_mutex to + * be held. Since this may result in an ABBA deadlock with + * pipe->inode, we have to order lock acquiry here. + */ + inode_double_lock(inode, pipe->inode); for (;;) { if (pipe->nrbufs) { @@ -797,8 +803,7 @@ ssize_t splice_from_pipe(struct pipe_ino pipe_wait(pipe); } - if (pipe->inode) - mutex_unlock(&pipe->inode->i_mutex); + inode_double_unlock(inode, pipe->inode); if (do_wakeup) { smp_mb(); @@ -1400,13 +1405,7 @@ static int link_pipe(struct pipe_inode_i * grabbing by inode address. Otherwise two different processes * could deadlock (one doing tee from A -> B, the other from B -> A). */ - if (ipipe->inode < opipe->inode) { - mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_PARENT); - mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_CHILD); - } else { - mutex_lock_nested(&opipe->inode->i_mutex, I_MUTEX_PARENT); - mutex_lock_nested(&ipipe->inode->i_mutex, I_MUTEX_CHILD); - } + inode_double_lock(ipipe->inode, opipe->inode); do { if (!opipe->readers) { @@ -1450,8 +1449,7 @@ static int link_pipe(struct pipe_inode_i i++; } while (len); - mutex_unlock(&ipipe->inode->i_mutex); - mutex_unlock(&opipe->inode->i_mutex); + inode_double_unlock(ipipe->inode, opipe->inode); /* * If we put data in the output pipe, wakeup any potential readers. diff --git a/include/linux/fs.h b/include/linux/fs.h index 34406ed..31d8548 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -624,6 +624,35 @@ enum inode_i_mutex_lock_class }; /* + * We rarely want to lock two inodes that do not have a parent/child + * relationship (such as directory, child inode) simultaneously. The + * vast majority of file systems should be able to get along fine + * without this. Do not use these functions except as a last resort. + */ +static inline void inode_double_lock(struct inode *inode1, struct inode *inode2) +{ + if (!inode2) { + mutex_lock(&inode1->i_mutex); + return; + } + + if (inode1 < inode2) { + mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT); + mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD); + } else { + mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT); + mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD); + } +} + +static inline void inode_double_unlock(struct inode *inode1, struct inode *inode2) +{ + mutex_unlock(&inode1->i_mutex); + if (inode2) + mutex_unlock(&inode2->i_mutex); +} + +/* * NOTE: in a 32bit arch with a preemptable kernel and * an UP compile the i_size_read/write must be atomic * with respect to the local cpu (unlike with preempt disabled), -- 1.4.2.3 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html