Re: [PATCH 2/3] xfs: convert inode cache lookups to use RCU locking

"Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> · Tue, 14 Dec 2010 22:34:35 -0800

On Wed, Dec 15, 2010 at 01:50:49PM +1100, Dave Chinner wrote:
> On Tue, Dec 14, 2010 at 05:05:36PM -0800, Paul E. McKenney wrote:
> > On Wed, Dec 15, 2010 at 10:00:47AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 14, 2010 at 01:18:01PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Dec 13, 2010 at 12:32:36PM +1100, Dave Chinner wrote:
> > > > > +	/*
> > > > > +	 * check for re-use of an inode within an RCU grace period due to the
> > > > > +	 * radix tree nodes not being updated yet. We monitor for this by
> > > > > +	 * setting the inode number to zero before freeing the inode structure.
> > > > > +	 * If the inode has been reallocated and set up, then the inode number
> > > > > +	 * will not match, so check for that, too.
> > > > > +	 */
> > > > >  	spin_lock(&ip->i_flags_lock);
> > > > > +	if (ip->i_ino != ino) {
> > > > > +		trace_xfs_iget_skip(ip);
> > > > > +		XFS_STATS_INC(xs_ig_frecycle);
> > > > > +		spin_unlock(&ip->i_flags_lock);
> > > > > +		rcu_read_unlock();
> > > > > +		/* Expire the grace period so we don't trip over it again. */
> > > > > +		synchronize_rcu();
> > > > 
> > > > Hmmm...  Interesting.  Wouldn't the fact that we acquired the same lock
> > > > that was held after removing the inode guarantee that an immediate retry
> > > > would manage not to find this same inode again?
> > > 
> > > That is what I'm not sure of. I was more worried about resolving the
> > > contents of the radix tree nodes, not so much the inode itself. If a
> > > new traversal will resolve the tree correctly (which is what you are
> > > implying), then synchronize_rcu() is not needed....
> > 
> > Here is the sequence of events that I believe must be in place:
> 
> s/vfs_inode/xfs_inode/g

Good point!

> > 1.	Some CPU removes the vfs_inode from the radix tree, presumably
> > 	while holding some lock whose identity does not matter here.
> 
> Yes, the ->pag_ici_lock.

OK.

> > 2.	Before invoking call_rcu() on a given vfs_inode, this same
> > 	CPU clears the inode number while holding ->i_flags_lock.
> 
> Not necessarily the same CPU - there are points where we take
> sleeping locks for synchronisation with any remaining users, and
> we don't use preempt_disable() to prevent a change of CPU on a
> preemptible kernel.
> 
> > 	If the CPU in #1 and #2 might be different, then either
> > 	CPU #1 must hold ->i_flags_lock while removing the vfs_inode
> > 	from the radix tree, or I don't understand how this can work
> > 	even with the synchronize_rcu().
> 
> I'm probably missing something, but why does the CPU we run
> call_rcu() to free the inode on matter w.r.t. which CPU it was
> deleted from the radix tree on?

I was oversimplifying.  What matters is that the item was deleted
from the radix tree unambiguously before it was passed to call_rcu().
There are a number of ways to make this happen:

1.	Do the removal and the call_rcu() on the same CPU, in that
	order.

2.	Do the removal while holding a given lock, and do the call_rcu()
	under a later critical section for that same lock.

3.	Do the removal while holding lock A one CPU 1, then later
	acquire lock B on CPU 1, and then do the call_rcu() after
	a later acquisition of lock B on some other CPU.

There are a bunch of other variations on this theme, but the key
requirement is again that the call_rcu() happen unambiguously after
the removal.  Otherwise, how is the grace period supposed to
guarantee that all RCU readers that might be accessing the removed
xfs_inode really have completed?

> There is this comment in include/linux/radix-tree.h:
> 
>  * It is still required that the caller manage the synchronization and lifetimes
>  * of the items. So if RCU lock-free lookups are used, typically this would mean
>  * that the items have their own locks, or are amenable to lock-free access; and
>  * that the items are freed by RCU (or only freed after having been deleted from
>  * the radix tree *and* a synchronize_rcu() grace period).
> 
> There is nothing there that mentions the items need to be deleted on
> the same CPU as they were removed from the radix tree or that the
> item lock needs to be held when the item is removed from the tree.
> AFAICT, the XFS code is following these guidelines.

Well, the grace period (from either synchronize_rcu() or call_rcu())
does need to start unambiguously after the deletion from the radix tree.
Should we upgrade the comment?

> FWIW, this is where is got the idea of using synchronize_rcu() to
> ensure a llokup retry wouldn't see the same freed inode. I'm
> thinking that simply re-running the lookup will give the same
> guarantee because of the memory barriers in the radix tree lookup
> code...

Maybe...  But we do need to be sure, right?

> > 3.	Some CPU, possibly different than that in #1 and #2 above,
> > 	executes the above code.  If locking works correctly, it
> > 	must see the earlier changes, so a subsequent access should
> > 	see the results of #1 above, so that it won't see the element
> > 	that was removed there.
> 
> Isn't the item validity considered separately to the tree traversal
> consistency? i.e. the item lock (i.e. ->i_flags_lock) provides a
> safe item validity check via the unlock->lock memory barrier, whilst
> the radix tree uses rcu_dereference() to provide memory barriers
> against the modifications?

But the safe validity check assumes that the RCU grace period starts
unambiguously after the item has been removed.  Violate that assumption,
and all bets are off.

> > That said, I don't claim to understand either vfs or xfs very well, so
> > I would be arbitrarily deeply confused.
> 
> We might both be confused ;)

Sounds like the most likely possibility, now that you mention it.  ;-)

							Thanx, Paul

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs