On Wed, Sep 26, 2012 at 10:54:09AM +1000, Dave Chinner wrote: > On Tue, Sep 25, 2012 at 04:59:55PM +0800, Guo Chao wrote: > > On Mon, Sep 24, 2012 at 06:26:54PM +1000, Dave Chinner wrote: > > > @@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode *inode); > > > static struct inode *find_inode(struct super_block *sb, > > > struct hlist_head *head, > > > int (*test)(struct inode *, void *), > > > - void *data) > > > + void *data, bool locked) > > > { > > > struct hlist_node *node; > > > struct inode *inode = NULL; > > > > > > repeat: > > > - hlist_for_each_entry(inode, node, head, i_hash) { > > > + rcu_read_lock(); > > > + hlist_for_each_entry_rcu(inode, node, head, i_hash) { > > > spin_lock(&inode->i_lock); > > > + if (inode_unhashed(inode)) { > > > + spin_unlock(&inode->i_lock); > > > + continue; > > > + } > > > > Is this check too early? If the unhashed inode happened to be the target > > inode, we are wasting our time to continue the traversal and we do not wait > > on it. > > If the inode is unhashed, then it is already passing through evict() > or has already passed through. If it has already passed through > evict() then it is too late to call __wait_on_freeing_inode() as the > wakeup occurs in evict() immediately after the inode is removed > from the hash. i.e: > > remove_inode_hash(inode); > > spin_lock(&inode->i_lock); > wake_up_bit(&inode->i_state, __I_NEW); > BUG_ON(inode->i_state != (I_FREEING | I_CLEAR)); > spin_unlock(&inode->i_lock); > > i.e. if we get the case: > > Thread 1, RCU hash traversal Thread 2, evicting foo > > rcu_read_lock() > found inode foo > remove_inode_hash(inode); > spin_lock(&foo->i_lock); > wake_up(I_NEW) > spin_unlock(&foo->i_lock); > destroy_inode() > ...... > spin_lock(foo->i_lock) > match sb, ino > I_FREEING > rcu_read_unlock() > > <rcu grace period can expire at any time now, > so use after free is guaranteed at some point> > > wait_on_freeing_inode > wait_on_bit(I_NEW) > > <will never get woken> > > Hence if the inode is unhashed, it doesn't matter what inode it is, > it is never valid to use it any further because it may have already > been freed and the only reason we can safely access here it is that > the RCU grace period will not expire until we call > rcu_read_unlock(). > Yeah, looks right. > > > @@ -1078,8 +1098,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino) > > > struct inode *old; > > > > > > spin_lock(&inode_hash_lock); > > > - /* We released the lock, so.. */ > > > - old = find_inode_fast(sb, head, ino); > > > + old = find_inode_fast(sb, head, ino, true); > > > if (!old) { > > > inode->i_ino = ino; > > > spin_lock(&inode->i_lock); > > > > Emmmm ... couldn't we use memory barrier API instead of irrelevant spin > > lock on newly allocated inode to publish I_NEW? > > Yes, we could. > > However, having multiple synchronisation methods for a single > variable that should only be used in certain circumstances is > something that is easy to misunderstand and get wrong. Memory > barriers are much more subtle and harder to understand than spin > locks, and every memory barrier needs to be commented to explain > what the barrier is actually protecting against. > > In the case where a spin lock is guaranteed to be uncontended and > the cache line hot in the CPU cache, it makes no sense to replace > the spin lock with a memory barrier, especially when every other > place we modify the i_state/i_hash fields we have to wrap them > with i_lock.... > > Simple code is good code - save the complexity for something that > needs it. > Emmm, I doubt "it's simpler and need no document". I bet someday there will be other guys stand out and ask "why take spin lock on a inode which apparently does not subject to any race condition?". > I know that the per-sb inode lru lock is currently the hotest of the > inode cache locks (performance limiting at somewhere in the range of > 8-16way workloads on XFS), and I've got work in (slow) progress to > address that. That work will also the address the per-sb dentry LRU > locks, which are the hotest dentry cache locks as well. > Glad to hear that. Thank your for all your explanation, especially historical ones. Regards, Guo Chao -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html