From: Dave Chinner <dchinner@xxxxxxxxxx> Currently we have an optimisation in place in unlock_new_inode() to avoid taking the inode_lock. This uses memory barriers to ensure that the the clearing of I_NEW can be seen by all other CPUs. It also means the wake_up_inode() call relies on a memory barrier to ensure that processes it is waking see any changes the i_state field prior to the wakeup being issued. This serialisation is also necessary to protect the inode against lookup/freeing races once the inode_lock is removed. The current lookup and wait code is atomic as it is all performed under the inode_lock, while the modified code now splits the locks. Hence we can get the following race: Thread 1: Thread 2: iput_final spin_lock(&inode->i_lock); hlist_bl_lock() Find inode spin_lock(&inode->i_lock) ...... inode->i_state = I_FREEING; spin_unlock(&inode->i_lock); remove_inode_hash() hlist_bl_lock() ...... if (I_FREEING) hlist_bl_unlock() ...... hlist_bl_del_init() hlist_bl_unlock() wake_up_inode() __wait_on_freeing_inode() put on waitqueue spin_unlock(&inode->i_lock); schedule() To avoid this race, wake ups need to be serialised against the waiters, and using inode->i_lock is the natural solution. The memory barrier optimisation is no longer needed to avoid the global inode_lock contention as the inode->i_state field is now protected by inode->i_lock. Hence we can revert the code to a much simpler form and correctly serialise wake ups against __wait_on_freeing_inode() by holding the inode->i_lock while we do the wakeup. Given the newfound simplicity of wake_up_inode() and the fact that we need to change i_state in unlock_new_inode() before the wakeup, just open code the wakeup in the couple of spots it is used and remove wake_up_inode() entirely. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> --- fs/inode.c | 41 ++++++++++++++++++----------------------- 1 files changed, 18 insertions(+), 23 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 32bac58..ba514a1 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -175,15 +175,6 @@ int proc_nr_inodes(ctl_table *table, int write, } #endif -static void wake_up_inode(struct inode *inode) -{ - /* - * Prevent speculative execution through spin_unlock(&inode->i_lock); - */ - smp_mb(); - wake_up_bit(&inode->i_state, __I_NEW); -} - /** * inode_init_always - perform inode structure intialisation * @sb: superblock inode belongs to @@ -540,7 +531,9 @@ static void dispose_list(struct list_head *head) __inode_sb_list_del(inode); spin_unlock(&inode_lock); - wake_up_inode(inode); + spin_lock(&inode->i_lock); + wake_up_bit(&inode->i_state, __I_NEW); + spin_unlock(&inode->i_lock); destroy_inode(inode); } } @@ -862,6 +855,13 @@ struct inode *new_inode(struct super_block *sb) } EXPORT_SYMBOL(new_inode); +/** + * unlock_new_inode - clear the I_NEW state and wake up any waiters + * @inode: new inode to unlock + * + * Called when the inode is fully initialised to clear the new state of the + * inode and wake up anyone waiting for the inode to finish initialisation. + */ void unlock_new_inode(struct inode *inode) { #ifdef CONFIG_DEBUG_LOCK_ALLOC @@ -881,19 +881,11 @@ void unlock_new_inode(struct inode *inode) } } #endif - /* - * This is special! We do not need the spinlock when clearing I_NEW, - * because we're guaranteed that nobody else tries to do anything about - * the state of the inode when it is locked, as we just created it (so - * there can be no old holders that haven't tested I_NEW). - * However we must emit the memory barrier so that other CPUs reliably - * see the clearing of I_NEW after the other inode initialisation has - * completed. - */ - smp_mb(); + spin_lock(&inode->i_lock); WARN_ON(!(inode->i_state & I_NEW)); inode->i_state &= ~I_NEW; - wake_up_inode(inode); + wake_up_bit(&inode->i_state, __I_NEW); + spin_unlock(&inode->i_lock); } EXPORT_SYMBOL(unlock_new_inode); @@ -1486,8 +1478,10 @@ static void iput_final(struct inode *inode) spin_unlock(&inode_lock); evict(inode); remove_inode_hash(inode); - wake_up_inode(inode); + spin_lock(&inode->i_lock); + wake_up_bit(&inode->i_state, __I_NEW); BUG_ON(inode->i_state != (I_FREEING | I_CLEAR)); + spin_unlock(&inode->i_lock); destroy_inode(inode); } @@ -1690,7 +1684,8 @@ EXPORT_SYMBOL(inode_wait); * to recheck inode state. * * It doesn't matter if I_NEW is not set initially, a call to - * wake_up_inode() after removing from the hash list will DTRT. + * wake_up_bit(&inode->i_state, __I_NEW) with the i_lock held after removing + * from the hash list will DTRT. * * This is called with inode_lock held. * -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html