Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

Guo Chao <yan@xxxxxxxxxxxxxxxxxx> · Thu, 27 Sep 2012 16:41:48 +0800

On Wed, Sep 26, 2012 at 10:54:09AM +1000, Dave Chinner wrote:
> On Tue, Sep 25, 2012 at 04:59:55PM +0800, Guo Chao wrote:
> > On Mon, Sep 24, 2012 at 06:26:54PM +1000, Dave Chinner wrote:
> > > @@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode *inode);
> > >  static struct inode *find_inode(struct super_block *sb,
> > >  				struct hlist_head *head,
> > >  				int (*test)(struct inode *, void *),
> > > -				void *data)
> > > +				void *data, bool locked)
> > >  {
> > >  	struct hlist_node *node;
> > >  	struct inode *inode = NULL;
> > > 
> > >  repeat:
> > > -	hlist_for_each_entry(inode, node, head, i_hash) {
> > > +	rcu_read_lock();
> > > +	hlist_for_each_entry_rcu(inode, node, head, i_hash) {
> > >  		spin_lock(&inode->i_lock);
> > > +		if (inode_unhashed(inode)) {
> > > +			spin_unlock(&inode->i_lock);
> > > +			continue;
> > > +		}
> > 
> > Is this check too early? If the unhashed inode happened to be the target
> > inode, we are wasting our time to continue the traversal and we do not wait 
> > on it.
> 
> If the inode is unhashed, then it is already passing through evict()
> or has already passed through. If it has already passed through
> evict() then it is too late to call __wait_on_freeing_inode() as the
> wakeup occurs in evict() immediately after the inode is removed
> from the hash. i.e:
> 
>         remove_inode_hash(inode);
> 
>         spin_lock(&inode->i_lock);
>         wake_up_bit(&inode->i_state, __I_NEW);
>         BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
>         spin_unlock(&inode->i_lock);
> 
> i.e. if we get the case:
> 
> Thread 1, RCU hash traversal		Thread 2, evicting foo
> 
> rcu_read_lock()
> found inode foo
> 					remove_inode_hash(inode);
> 					spin_lock(&foo->i_lock);
> 					wake_up(I_NEW)
> 					spin_unlock(&foo->i_lock);
> 					destroy_inode()
> 					......
> 	spin_lock(foo->i_lock)
> 	match sb, ino
> 	I_FREEING
> 	  rcu_read_unlock()
> 
> <rcu grace period can expire at any time now,
>  so use after free is guaranteed at some point>
> 
> 	  wait_on_freeing_inode
> 	    wait_on_bit(I_NEW)
> 
> <will never get woken>
> 
> Hence if the inode is unhashed, it doesn't matter what inode it is,
> it is never valid to use it any further because it may have already
> been freed and the only reason we can safely access here it is that
> the RCU grace period will not expire until we call
> rcu_read_unlock().
>

Yeah, looks right.

> > > @@ -1078,8 +1098,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
> > >  		struct inode *old;
> > > 
> > >  		spin_lock(&inode_hash_lock);
> > > -		/* We released the lock, so.. */
> > > -		old = find_inode_fast(sb, head, ino);
> > > +		old = find_inode_fast(sb, head, ino, true);
> > >  		if (!old) {
> > >  			inode->i_ino = ino;
> > >  			spin_lock(&inode->i_lock);
> > 
> > Emmmm ... couldn't we use memory barrier API instead of irrelevant spin
> > lock on newly allocated inode to publish I_NEW?
> 
> Yes, we could.
> 
> However, having multiple synchronisation methods for a single
> variable that should only be used in certain circumstances is
> something that is easy to misunderstand and get wrong. Memory
> barriers are much more subtle and harder to understand than spin
> locks, and every memory barrier needs to be commented to explain
> what the barrier is actually protecting against.
> 
> In the case where a spin lock is guaranteed to be uncontended and
> the cache line hot in the CPU cache, it makes no sense to replace
> the spin lock with a memory barrier, especially when every other
> place we modify the i_state/i_hash fields we have to wrap them
> with i_lock....
> 
> Simple code is good code - save the complexity for something that
> needs it.
> 

Emmm, I doubt "it's simpler and need no document". 

I bet someday there will be other guys stand out and ask "why take spin 
lock on a inode which apparently does not subject to any race condition?". 

> I know that the per-sb inode lru lock is currently the hotest of the
> inode cache locks (performance limiting at somewhere in the range of
> 8-16way workloads on XFS), and I've got work in (slow) progress to
> address that.  That work will also the address the per-sb dentry LRU
> locks, which are the hotest dentry cache locks as well.
> 

Glad to hear that.

Thank your for all your explanation, especially historical ones.

Regards,
Guo Chao

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html