On Wed, Dec 15, 2010 at 10:00:47AM +1100, Dave Chinner wrote: > On Tue, Dec 14, 2010 at 01:18:01PM -0800, Paul E. McKenney wrote: > > On Mon, Dec 13, 2010 at 12:32:36PM +1100, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > > > With delayed logging greatly increasing the sustained parallelism of inode > > > operations, the inode cache locking is showing significant read vs write > > > contention when inode reclaim runs at the same time as lookups. There is > > > also a lot more write lock acquistions than there are read locks (4:1 ratio) > > > so the read locking is not really buying us much in the way of parallelism. > > > > > > To avoid the read vs write contention, change the cache to use RCU locking on > > > the read side. To avoid needing to RCU free every single inode, use the built > > > in slab RCU freeing mechanism. This requires us to be able to detect lookups of > > > freed inodes, so enÑure that ever freed inode has an inode number of zero and > > > the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit > > > lookup path, but also add a check for a zero inode number as well. > > > > > > We canthen convert all the read locking lockups to use RCU read side locking > > > and hence remove all read side locking. > > > > OK, so the xfs_inode uses straight RCU, and there fore cannot be freed and > > immediately reallocated, while the inode itself uses SLAB_DESTROY_BY_RCU, > > which does allow the inode to be freed and immediately reallocated, > > correct? > > No, the struct inode is embedded in the struct xfs_inode, so they > have the same lifecycle. i.e. we don't separately allocate and free > the struct inode. So it is all using straight RCU. > > > Some questions and comments below. > > > > Thanx, Paul > > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > > Reviewed-by: Alex Elder <aelder@xxxxxxx> > > > --- > > > fs/xfs/linux-2.6/xfs_sync.c | 27 ++++++++++++++++----- > > > fs/xfs/xfs_iget.c | 50 +++++++++++++++++++++++++++++++---------- > > > fs/xfs/xfs_inode.c | 52 +++++++++++++++++++++++++++++++++---------- > > > 3 files changed, 98 insertions(+), 31 deletions(-) > > > > > > diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c > > > index afb0d7c..5ee02d7 100644 > > > --- a/fs/xfs/linux-2.6/xfs_sync.c > > > +++ b/fs/xfs/linux-2.6/xfs_sync.c > > > @@ -53,14 +53,20 @@ xfs_inode_ag_walk_grab( > > > { > > > struct inode *inode = VFS_I(ip); > > > > > > + /* check for stale RCU freed inode */ > > > + spin_lock(&ip->i_flags_lock); > > > + if (!ip->i_ino) > > > + goto out_unlock_noent; > > > + > > > + /* avoid new or reclaimable inodes. Leave for reclaim code to flush */ > > > + if (__xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM)) > > > + goto out_unlock_noent; > > > + spin_unlock(&ip->i_flags_lock); > > > + > > > > OK, this works because the xfs_inode cannot be freed and reallocated (which > > the RCU change should enforce). It is not clear to me that the above > > would work if using the SLAB_DESTROY_BY_RCU approach, at least not unless > > some higher-level checks can reliably catch an inode changing identity > > due to quick free and reallocation. > > In this case, I don't believe it matters if the inode has changed > identity - we are walking for writeback, so if we get a reallocated > inode we'll write it back if it is dirty. If it has not been > reallocated or still being initialised, the !ip->i_ino and > XFS_INEW|XFS_IRECLAIM checks are sufficient to avoid using the inode. > > I probably should add a comment to this effect, yes? One vote in favor from me. ;-) > > Also, all calls to xfs_inode_ag_walk_grab() must be in RCU read-side > > critical sections... I suggest a debug check for rcu_read_lock_held() to > > catch any call paths that might have slipped through. > > Yes, good idea. > > > At first glance, > > it appears that RCU is replacing some of ->pag_ici_lock, but I could > > easily be mistaken. > > Correct, it is replacing the read side of the ->pag_ici_lock. OK, good! > > > @@ -639,6 +649,9 @@ xfs_reclaim_inode_grab( > > > struct xfs_inode *ip, > > > int flags) > > > { > > > + /* check for stale RCU freed inode */ > > > + if (!ip->i_ino) > > > + return 1; > > > > Does this mean that we need to be under rcu_read_lock() here? If not, > > how do we keep the inode from really being freed out from under us? > > Hmmm, I think I've mismerged a patch somewhere along the line. In > this patch the reclaim tree walk is under the ->pag_ici_lock(), when > in fact it should be under the rcu_read_lock(). Good catch, Paul. > > As it is, being under the ->pag_ici_lock means that the tree is > consistent and we won't be seeing RCU freed inodes in the walk. > Hence the code is functioning correctly, just not as wasss intended. > > > (Again, if we do need to be under rcu_read_lock(), I highly recommend > > a debug check for rcu_read_lock_held().) > > Yup, which would have caught the merge screwup... ;-) > .... > > > > > > + /* > > > + * check for re-use of an inode within an RCU grace period due to the > > > + * radix tree nodes not being updated yet. We monitor for this by > > > + * setting the inode number to zero before freeing the inode structure. > > > + * If the inode has been reallocated and set up, then the inode number > > > + * will not match, so check for that, too. > > > + */ > > > spin_lock(&ip->i_flags_lock); > > > + if (ip->i_ino != ino) { > > > + trace_xfs_iget_skip(ip); > > > + XFS_STATS_INC(xs_ig_frecycle); > > > + spin_unlock(&ip->i_flags_lock); > > > + rcu_read_unlock(); > > > + /* Expire the grace period so we don't trip over it again. */ > > > + synchronize_rcu(); > > > > Hmmm... Interesting. Wouldn't the fact that we acquired the same lock > > that was held after removing the inode guarantee that an immediate retry > > would manage not to find this same inode again? > > That is what I'm not sure of. I was more worried about resolving the > contents of the radix tree nodes, not so much the inode itself. If a > new traversal will resolve the tree correctly (which is what you are > implying), then synchronize_rcu() is not needed.... Here is the sequence of events that I believe must be in place: 1. Some CPU removes the vfs_inode from the radix tree, presumably while holding some lock whose identity does not matter here. 2. Before invoking call_rcu() on a given vfs_inode, this same CPU clears the inode number while holding ->i_flags_lock. If the CPU in #1 and #2 might be different, then either CPU #1 must hold ->i_flags_lock while removing the vfs_inode from the radix tree, or I don't understand how this can work even with the synchronize_rcu(). 3. Some CPU, possibly different than that in #1 and #2 above, executes the above code. If locking works correctly, it must see the earlier changes, so a subsequent access should see the results of #1 above, so that it won't see the element that was removed there. That said, I don't claim to understand either vfs or xfs very well, so I would be arbitrarily deeply confused. > > If this is not the case, then readers finding it again will not be > > protected by the RCU grace period, right? > > > > In short, I don't understand why the synchronize_rcu() is needed. > > If it is somehow helping, that sounds to me like it is covering up > > a real bug that should be fixed separately. > > It isn't covering up a bug, it was more tryingt o be consistent with > the rest of the xfs_inode lookup failures - we back off and try > again later. If that is unnecessary resolve the RCU lookup race, > then it can be dropped. OK, please let me know whether my sequence of steps above makes sense. > > > @@ -397,7 +423,7 @@ xfs_iget( > > > xfs_agino_t agino; > > > > > > /* reject inode numbers outside existing AGs */ > > > - if (XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount) > > > + if (!ino || XFS_INO_TO_AGNO(mp, ino) >= mp->m_sb.sb_agcount) > > > > For the above check to be safe, don't we need to already be in an > > RCU read-side critical section? Or is something else protecting us? > > "ino" is the inode number used as the lookup key to find the struct > xfs_inode. This is ensuring we don't try to look up an inode number > of zero given it's new special meaning as a freed inode. Hence it > can be safely validated outside the RCU read-side critical sectioni > as it is a constant. Ah, OK, got it! Thanx, Paul > Thanks for the review, Paul. I'll fix up the issues you've pointed > out and retest. > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs