On Wed, Dec 15, 2010 at 01:50:49PM +1100, Dave Chinner wrote: > On Tue, Dec 14, 2010 at 05:05:36PM -0800, Paul E. McKenney wrote: > > On Wed, Dec 15, 2010 at 10:00:47AM +1100, Dave Chinner wrote: > > > On Tue, Dec 14, 2010 at 01:18:01PM -0800, Paul E. McKenney wrote: > > > > On Mon, Dec 13, 2010 at 12:32:36PM +1100, Dave Chinner wrote: > > > > > + /* > > > > > + * check for re-use of an inode within an RCU grace period due to the > > > > > + * radix tree nodes not being updated yet. We monitor for this by > > > > > + * setting the inode number to zero before freeing the inode structure. > > > > > + * If the inode has been reallocated and set up, then the inode number > > > > > + * will not match, so check for that, too. > > > > > + */ > > > > > spin_lock(&ip->i_flags_lock); > > > > > + if (ip->i_ino != ino) { > > > > > + trace_xfs_iget_skip(ip); > > > > > + XFS_STATS_INC(xs_ig_frecycle); > > > > > + spin_unlock(&ip->i_flags_lock); > > > > > + rcu_read_unlock(); > > > > > + /* Expire the grace period so we don't trip over it again. */ > > > > > + synchronize_rcu(); > > > > > > > > Hmmm... Interesting. Wouldn't the fact that we acquired the same lock > > > > that was held after removing the inode guarantee that an immediate retry > > > > would manage not to find this same inode again? > > > > > > That is what I'm not sure of. I was more worried about resolving the > > > contents of the radix tree nodes, not so much the inode itself. If a > > > new traversal will resolve the tree correctly (which is what you are > > > implying), then synchronize_rcu() is not needed.... > > > > Here is the sequence of events that I believe must be in place: > > s/vfs_inode/xfs_inode/g Good point! > > 1. Some CPU removes the vfs_inode from the radix tree, presumably > > while holding some lock whose identity does not matter here. > > Yes, the ->pag_ici_lock. OK. > > 2. Before invoking call_rcu() on a given vfs_inode, this same > > CPU clears the inode number while holding ->i_flags_lock. > > Not necessarily the same CPU - there are points where we take > sleeping locks for synchronisation with any remaining users, and > we don't use preempt_disable() to prevent a change of CPU on a > preemptible kernel. > > > If the CPU in #1 and #2 might be different, then either > > CPU #1 must hold ->i_flags_lock while removing the vfs_inode > > from the radix tree, or I don't understand how this can work > > even with the synchronize_rcu(). > > I'm probably missing something, but why does the CPU we run > call_rcu() to free the inode on matter w.r.t. which CPU it was > deleted from the radix tree on? I was oversimplifying. What matters is that the item was deleted from the radix tree unambiguously before it was passed to call_rcu(). There are a number of ways to make this happen: 1. Do the removal and the call_rcu() on the same CPU, in that order. 2. Do the removal while holding a given lock, and do the call_rcu() under a later critical section for that same lock. 3. Do the removal while holding lock A one CPU 1, then later acquire lock B on CPU 1, and then do the call_rcu() after a later acquisition of lock B on some other CPU. There are a bunch of other variations on this theme, but the key requirement is again that the call_rcu() happen unambiguously after the removal. Otherwise, how is the grace period supposed to guarantee that all RCU readers that might be accessing the removed xfs_inode really have completed? > There is this comment in include/linux/radix-tree.h: > > * It is still required that the caller manage the synchronization and lifetimes > * of the items. So if RCU lock-free lookups are used, typically this would mean > * that the items have their own locks, or are amenable to lock-free access; and > * that the items are freed by RCU (or only freed after having been deleted from > * the radix tree *and* a synchronize_rcu() grace period). > > There is nothing there that mentions the items need to be deleted on > the same CPU as they were removed from the radix tree or that the > item lock needs to be held when the item is removed from the tree. > AFAICT, the XFS code is following these guidelines. Well, the grace period (from either synchronize_rcu() or call_rcu()) does need to start unambiguously after the deletion from the radix tree. Should we upgrade the comment? > FWIW, this is where is got the idea of using synchronize_rcu() to > ensure a llokup retry wouldn't see the same freed inode. I'm > thinking that simply re-running the lookup will give the same > guarantee because of the memory barriers in the radix tree lookup > code... Maybe... But we do need to be sure, right? > > 3. Some CPU, possibly different than that in #1 and #2 above, > > executes the above code. If locking works correctly, it > > must see the earlier changes, so a subsequent access should > > see the results of #1 above, so that it won't see the element > > that was removed there. > > Isn't the item validity considered separately to the tree traversal > consistency? i.e. the item lock (i.e. ->i_flags_lock) provides a > safe item validity check via the unlock->lock memory barrier, whilst > the radix tree uses rcu_dereference() to provide memory barriers > against the modifications? But the safe validity check assumes that the RCU grace period starts unambiguously after the item has been removed. Violate that assumption, and all bets are off. > > That said, I don't claim to understand either vfs or xfs very well, so > > I would be arbitrarily deeply confused. > > We might both be confused ;) Sounds like the most likely possibility, now that you mention it. ;-) Thanx, Paul _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs