On Fri, Nov 15, 2019 at 09:16:02AM +1100, Dave Chinner wrote: > On Wed, Nov 06, 2019 at 05:18:46PM -0500, Brian Foster wrote: > > On Fri, Nov 01, 2019 at 10:46:18AM +1100, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > > > Looking up an unreferenced inode in the inode cache is a bit hairy. > > > We do this for inode invalidation and writeback clustering purposes, > > > which is all invisible to the VFS. Hence we can't take reference > > > counts to the inode and so must be very careful how we do it. > > > > > > There are several different places that all do the lookups and > > > checks slightly differently. Fundamentally, though, they are all > > > racy and inode reclaim has to block waiting for the inode lock if it > > > loses the race. This is not very optimal given all the work we;ve > > > already done to make reclaim non-blocking. > > > > > > We can make the reclaim process nonblocking with a couple of simple > > > changes. If we define the unreferenced lookup process in a way that > > > will either always grab an inode in a way that reclaim will notice > > > and skip, or will notice a reclaim has grabbed the inode so it can > > > skip the inode, then there is no need for reclaim to need to cycle > > > the inode ILOCK at all. > > > > > > Selecting an inode for reclaim is already non-blocking, so if the > > > ILOCK is held the inode will be skipped. If we ensure that reclaim > > > holds the ILOCK until the inode is freed, then we can do the same > > > thing in the unreferenced lookup to avoid inodes in reclaim. We can > > > do this simply by holding the ILOCK until the RCU grace period > > > expires and the inode freeing callback is run. As all unreferenced > > > lookups have to hold the rcu_read_lock(), we are guaranteed that > > > a reclaimed inode will be noticed as the trylock will fail. > > > > > ... > > > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > > --- > > > fs/xfs/mrlock.h | 27 +++++++++ > > > fs/xfs/xfs_icache.c | 88 +++++++++++++++++++++-------- > > > fs/xfs/xfs_inode.c | 131 +++++++++++++++++++++----------------------- > > > 3 files changed, 153 insertions(+), 93 deletions(-) > > > > > > diff --git a/fs/xfs/mrlock.h b/fs/xfs/mrlock.h > > > index 79155eec341b..1752a2592bcc 100644 > > > --- a/fs/xfs/mrlock.h > > > +++ b/fs/xfs/mrlock.h > > ... > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > index 11bf4768d491..45ee3b5cd873 100644 > > > --- a/fs/xfs/xfs_icache.c > > > +++ b/fs/xfs/xfs_icache.c > > > @@ -106,6 +106,7 @@ xfs_inode_free_callback( > > > ip->i_itemp = NULL; > > > } > > > > > > + mrunlock_excl_non_owner(&ip->i_lock); > > > kmem_zone_free(xfs_inode_zone, ip); > > > } > > > > > > @@ -132,6 +133,7 @@ xfs_inode_free( > > > * free state. The ip->i_flags_lock provides the barrier against lookup > > > * races. > > > */ > > > + mrupdate_non_owner(&ip->i_lock); > > > > Can we tie these into the proper locking interface using flags? For > > example, something like xfs_ilock(ip, XFS_ILOCK_EXCL|XFS_ILOCK_NONOWNER) > > or xfs_ilock(ip, XFS_ILOCK_EXCL_NONOWNER) perhaps? > > I'd prefer not to make this part of the common locking interface - > it's a one off special use case, not something we want to progate > elsewhere into the code. > What urks me about this is that it obfuscates rather than highlights that fact because I have no idea what mrtryupdate_non_owner() actually does without looking it up. We could easily name a flag XFS_ILOCK_PENDING_RECLAIM or something similarly ridiculous to make it blindingly obvious it should only be used in a special context. > Now that I think over it, I probably should have tagged this with > patch with [RFC]. I think we should just get rid of the mrlock > wrappers rather than add more, and that would simplify this a lot. > Yeah, FWIW I've been reviewing this patch as a WIP on top of all of the nonblocking bits as opposed to being some fundamental part of that work. That aside, I also agree that cleaning up these wrappers might address that concern because something like: /* open code ilock because ... */ down_write_trylock_non_owner(&ip->i_lock); ... is at least readable code. > > > > spin_lock(&ip->i_flags_lock); > > > ip->i_flags = XFS_IRECLAIM; > > > ip->i_ino = 0; > > > @@ -295,11 +297,24 @@ xfs_iget_cache_hit( > > > } > > > > > > /* > > > - * We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode > > > - * from stomping over us while we recycle the inode. Remove it > > > - * from the LRU straight away so we can re-init the VFS inode. > > > + * Before we reinitialise the inode, we need to make sure > > > + * reclaim does not pull it out from underneath us. We already > > > + * hold the i_flags_lock, and because the XFS_IRECLAIM is not > > > + * set we know the inode is still on the LRU. However, the LRU > > > + * code may have just selected this inode to reclaim, so we need > > > + * to ensure we hold the i_flags_lock long enough for the > > > + * trylock in xfs_inode_reclaim_isolate() to fail. We do this by > > > + * removing the inode from the LRU, which will spin on the LRU > > > + * list locks until reclaim stops walking, at which point we > > > + * know there is no possible race between reclaim isolation and > > > + * this lookup. > > > + * > > > > Somewhat related to my question about the lru_lock on the earlier patch. > > *nod* > > The caveat here is that this is the slow path so spinning for a > while doesn't really matter. > > > > @@ -1022,19 +1076,7 @@ xfs_dispose_inode( > > > spin_unlock(&pag->pag_ici_lock); > > > xfs_perag_put(pag); > > > > > > - /* > > > - * Here we do an (almost) spurious inode lock in order to coordinate > > > - * with inode cache radix tree lookups. This is because the lookup > > > - * can reference the inodes in the cache without taking references. > > > - * > > > - * We make that OK here by ensuring that we wait until the inode is > > > - * unlocked after the lookup before we go ahead and free it. > > > - * > > > - * XXX: need to check this is still true. Not sure it is. > > > - */ > > > - xfs_ilock(ip, XFS_ILOCK_EXCL); > > > xfs_qm_dqdetach(ip); > > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > Ok, so I'm staring at this a bit more and think I'm missing something. > > If we put aside the change to hold ilock until the inode is freed, we > > basically have the following (simplified) flow as the inode goes from > > isolation to disposal: > > > > ilock (isolate) > > iflock > > set XFS_IRECLAIM > > ifunlock (disposal) > > iunlock > > radix delete > > ilock cycle (drain) > > rcu free > > > > What we're trying to eliminate is the ilock cycle to drain any > > concurrent unreferenced lookups from accessing the inode once it is > > freed. The free itself is still RCU protected. > > > > Looking over at the ifree path, we now have something like this: > > > > rcu_read_lock() > > radix lookup > > check XFS_IRECLAIM > > ilock > > if XFS_ISTALE, skip > > set XFS_ISTALE > > rcu_read_unlock() > > iflock > > /* return locked down inode */ > > You missed a lock. > > rcu_read_lock() > radix lookup > >>> i_flags_lock > check XFS_IRECLAIM > ilock > if XFS_ISTALE, skip > set XFS_ISTALE > >>> i_flags_unlock > rcu_read_unlock() > iflock > > > Given that we set XFS_IRECLAIM under ilock, would we still need either > > the ilock cycle or to hold ilock through the RCU free if the ifree side > > (re)checked XFS_IRECLAIM after it has the ilock (but before it drops the > > rcu read lock)? > > We set XFS_IRECLAIM under the i_flags_lock. > > It is the combination of rcu_read_lock() and i_flags_lock() that > provides the RCU lookup state barriers - the ILOCK is not part of > that at all. > > The key point here is that once we've validated the inode we found > in the radix tree under the i_flags_lock, we then take the ILOCK, > thereby serialising the taking of the ILOCK here with the taking of > the ILOCK in the reclaim isolation code. > > i.e. all the reclaim state serialisation is actually based around > holding the i_flags_lock, not the ILOCK. > > Once we have grabbed the ILOCK under the i_flags_lock, we can > drop the i_flags_lock knowing that reclaim will not be able isolate > this inode and set XFS_IRECLAIM. > Hmm, Ok. I knew i_flags_lock was in there when I wrote this up. I intentionally left it out as a simplification because it wasn't clear to me that it was a critical part of the lookup. I'll keep this in mind the next time I walk through this code. > > ISTM we should either have a non-reclaim inode with > > ilock protection or a reclaim inode with RCU protection (so we can skip > > it before it frees), but I could easily be missing something here.. > > Heh. Yeah, it's a complex dance, and it's all based around how > RCU lookups and the i_flags_lock interact to provide coherent > detection of freed inodes. > > I have a nagging feeling that this whole ILOCK-held-to-rcu-free game > can be avoided. I need to walk myself through the lookup state > machine again and determine if ordering the XFS_IRECLAIM flag check > after greabbing the ILOCK is sufficient to prevent ifree/iflush > lookups from accessing the inode outside the rcu_read_lock() > context. > That's pretty much what I was wondering... > If so, most of this patch will go away.... > > > > + * attached to the buffer so we don't need to do anything more here. > > > */ > > > - if (ip != free_ip) { > > > - if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) { > > > - rcu_read_unlock(); > > > - delay(1); > > > - goto retry; > > > - } > > > - > > > - /* > > > - * Check the inode number again in case we're racing with > > > - * freeing in xfs_reclaim_inode(). See the comments in that > > > - * function for more information as to why the initial check is > > > - * not sufficient. > > > - */ > > > - if (ip->i_ino != inum) { > > > + if (__xfs_iflags_test(ip, XFS_ISTALE)) { > > > > Is there a correctness reason for why we move the stale check to under > > ilock (in both iflush/ifree)? > > It's under the i_flags_lock, and so I moved it up under the lookup > hold of the i_flags_lock so we don't need to cycle it again. > Yeah, but in both cases it looks like it moved to under the ilock as well, which comes after i_flags_lock. IOW, why grab ilock for stale inodes when we're just going to skip them? Brian > > > /* > > > - * We don't need to attach clean inodes or those only with unlogged > > > - * changes (which we throw away, anyway). > > > + * We don't need to attach clean inodes to the buffer - they are marked > > > + * stale in memory now and will need to be re-initialised by inode > > > + * allocation before they can be reused. > > > */ > > > if (!ip->i_itemp || xfs_inode_clean(ip)) { > > > ASSERT(ip != free_ip); > > > xfs_ifunlock(ip); > > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > + if (ip != free_ip) > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > There's an assert against this case just above, though I suppose there's > > nothing wrong with just keeping it and making the functional code more > > cautious. > > *nod* > > It follows Darrick's lead of making sure that production kernels > don't do something stupid because of some whacky corruption we > didn't expect to ever see. > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx >