Re: [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration

Brian Foster <bfoster@xxxxxxxxxx> · Wed, 10 Jun 2020 09:06:28 -0400

On Wed, Jun 10, 2020 at 08:01:39AM +1000, Dave Chinner wrote:
> On Tue, Jun 09, 2020 at 09:11:55AM -0400, Brian Foster wrote:
> > On Thu, Jun 04, 2020 at 05:46:04PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > > 
> > > Now that we have all the dirty inodes attached to the cluster
> > > buffer, we don't actually have to do radix tree lookups to find
> > > them. Sure, the radix tree is efficient, but walking a linked list
> > > of just the dirty inodes attached to the buffer is much better.
> > > 
> > > We are also no longer dependent on having a locked inode passed into
> > > the function to determine where to start the lookup. This means we
> > > can drop it from the function call and treat all inodes the same.
> > > 
> > > We also make xfs_iflush_cluster skip inodes marked with
> > > XFS_IRECLAIM. This we avoid races with inodes that reclaim is
> > > actively referencing or are being re-initialised by inode lookup. If
> > > they are actually dirty, they'll get written by a future cluster
> > > flush....
> > > 
> > > We also add a shutdown check after obtaining the flush lock so that
> > > we catch inodes that are dirty in memory and may have inconsistent
> > > state due to the shutdown in progress. We abort these inodes
> > > directly and so they remove themselves directly from the buffer list
> > > and the AIL rather than having to wait for the buffer to be failed
> > > and callbacks run to be processed correctly.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > > Reviewed-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > ---
> > >  fs/xfs/xfs_inode.c      | 148 ++++++++++++++++------------------------
> > >  fs/xfs/xfs_inode.h      |   2 +-
> > >  fs/xfs/xfs_inode_item.c |   2 +-
> > >  3 files changed, 62 insertions(+), 90 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index 8566bd0f4334d..931a483d5b316 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -3611,117 +3611,94 @@ xfs_iflush(
> > >   */
> > >  int
> > >  xfs_iflush_cluster(
> > > -	struct xfs_inode	*ip,
> > >  	struct xfs_buf		*bp)
> > >  {
> > ...
> > > +	/*
> > > +	 * We must use the safe variant here as on shutdown xfs_iflush_abort()
> > > +	 * can remove itself from the list.
> > > +	 */
> > > +	list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) {
> > > +		iip = (struct xfs_inode_log_item *)lip;
> > > +		ip = iip->ili_inode;
> > >  
> > >  		/*
> > > -		 * because this is an RCU protected lookup, we could find a
> > > -		 * recently freed or even reallocated inode during the lookup.
> > > -		 * We need to check under the i_flags_lock for a valid inode
> > > -		 * here. Skip it if it is not valid or the wrong inode.
> > > +		 * Quick and dirty check to avoid locks if possible.
> > >  		 */
> > > -		spin_lock(&cip->i_flags_lock);
> > > -		if (!cip->i_ino ||
> > > -		    __xfs_iflags_test(cip, XFS_ISTALE)) {
> > > -			spin_unlock(&cip->i_flags_lock);
> > > +		if (__xfs_iflags_test(ip, XFS_IRECLAIM | XFS_IFLOCK))
> > > +			continue;
> > > +		if (xfs_ipincount(ip))
> > >  			continue;
> > > -		}
> > >  
> > >  		/*
> > > -		 * Once we fall off the end of the cluster, no point checking
> > > -		 * any more inodes in the list because they will also all be
> > > -		 * outside the cluster.
> > > +		 * The inode is still attached to the buffer, which means it is
> > > +		 * dirty but reclaim might try to grab it. Check carefully for
> > > +		 * that, and grab the ilock while still holding the i_flags_lock
> > > +		 * to guarantee reclaim will not be able to reclaim this inode
> > > +		 * once we drop the i_flags_lock.
> > >  		 */
> > > -		if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
> > > -			spin_unlock(&cip->i_flags_lock);
> > > -			break;
> > > +		spin_lock(&ip->i_flags_lock);
> > > +		ASSERT(!__xfs_iflags_test(ip, XFS_ISTALE));
> > 
> > What prevents a race with staling an inode here?
> 
> xfs_buf_item_release() does not drop the buffer lock when stale
> buffers are committed. Hence the buffer it held locked until it is
> committed to the journal, and unpinning the buffer on journal IO
> completion runs xfs_iflush_done() -> xfs_iflush_abort() on all the
> stale attached inodes. At which point, they are removed from the
> buffer and the buffer is unlocked..
> 
> Hence we cannot run here with stale inodes attached to the buffer
> because the buffer is locked the entire time stale inodes are
> attached.
> 

Ok, so it's actually that the earlier ISTALE check on the inode is more
of an optimization since if we saw the flag set, we'd never be able to
lock the buffer anyways.

> > The push open codes an
> > unlocked (i.e. no memory barrier) check before it acquires the buffer
> > lock, so afaict it's technically possible to race and grab the buffer
> > immediately after the cluster was freed. If that could happen, it looks
> > like we'd also queue the buffer for write.
> 
> Not that I can tell, because we'll fail to get the buffer lock under
> the AIL lock until the stale buffer is unpinned, but the unpin
> occurs under the buffer lock and that removes the buffer from the
> AIL. Hence there's no way the AIL pushing can actually push an inode
> cluster buffer that has stale inodes attached to it...
> 

Makes sense.

> > That also raises the question.. between this patch and the earlier push
> > rework, why do we queue the buffer for write when nothing might have
> > been flushed by this call?
> 
> Largely because I never observed a failure to flush at least one
> inode so I didn't really consider it worth the additional complexity
> in error handling in the push code. I've really been concerned about
> the other end of the scale where we try to push the same buffer 30+
> times for each IO.
> 
> You are right in that it could happen - perhaps just returning EAGAIN
> if clcount == 0 at the end of the function is all that is necessary
> and translating that to XFS_ITEM_LOCKED in the push function would
> work.
> 

That sounds reasonable enough to cover that case.

> > > -		 * check is not sufficient.
> > > +		 * If we are shut down, unpin and abort the inode now as there
> > > +		 * is no point in flushing it to the buffer just to get an IO
> > > +		 * completion to abort the buffer and remove it from the AIL.
> > >  		 */
> > > -		if (!cip->i_ino) {
> > > -			xfs_ifunlock(cip);
> > > -			xfs_iunlock(cip, XFS_ILOCK_SHARED);
> > > +		if (XFS_FORCED_SHUTDOWN(mp)) {
> > > +			xfs_iunpin_wait(ip);
> > 
> > Note that we have an unlocked check above that skips pinned inodes.
> 
> Right, but we could be racing with a transaction commit that pinned
> the inode and a shutdown. As the comment says: it's a quick and
> dirty check to avoid trying to get locks when we know that it is
> unlikely we can flush the inode without blocking. We still have to
> recheck that state once we have the ILOCK....
> 

Right, but that means we can just as easily skip the shutdown processing
(which waits for unpin) if a particular inode is pinned. So which is
supposed to happen in the shutdown case?

ISTM that either could happen. As a result this kind of looks like
random logic to me. IIUC we rely on potentially coming back through the
cluster flush path multiple times if inodes are pinned because of the
racy pin/shutdown checks. If that's the case, then why not push the
shutdown check to after the locked pin count check, skip the
xfs_iunpin_wait() call entirely and then proceed to abort the flush at
the same point we'd call xfs_iflush() if the fs weren't shutdown?

FWIW, I'm also kind of wondering if we need this shutdown check at all
now that this path doesn't have to accommodate reclaim. We follow up
with failing the buffer in the exit path, but there's no guarantee there
aren't other inodes that are passed over due to the pin checks,
trylocks, etc. Can we just flush the inodes as normal and let xfsaild
fail the buffers on submission without additional shutdown checks?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
>