Re: [PATCH v2] xfs: fix unbalanced inode reclaim flush locking

Brian Foster <bfoster@xxxxxxxxxx> · Tue, 18 Oct 2016 11:12:08 -0400

On Tue, Oct 18, 2016 at 10:13:01AM -0400, Brian Foster wrote:
> On Tue, Oct 18, 2016 at 10:07:56AM +1100, Dave Chinner wrote:
> > On Mon, Oct 17, 2016 at 02:07:53PM -0400, Brian Foster wrote:
> > > Filesystem shutdown testing on an older distro kernel has uncovered an
> > > imbalanced locking pattern for the inode flush lock in
> > > xfs_reclaim_inode(). Specifically, there is a double unlock sequence
> > > between the call to xfs_iflush_abort() and xfs_reclaim_inode() at the
> > > "reclaim:" label.
> > > 
> > > This actually does not cause obvious problems on current kernels due to
> > > the current flush lock implementation. Older kernels use a counting
> > > based flush lock mechanism, however, which effectively breaks the lock
> > > indefinitely when an already unlocked flush lock is repeatedly unlocked.
> > > Though this only currently occurs on filesystem shutdown, it has
> > > reproduced the effect of elevating an fs shutdown to a system-wide crash
> > > or hang.
> > > 
> > > Because this problem exists on filesystem shutdown and thus only after
> > > unrelated catastrophic failure, issue the simple fix to reacquire the
> > > flush lock in xfs_reclaim_inode() before jumping to the reclaim code.
> > > Add an assert to xfs_ifunlock() to help prevent future occurrences of
> > > the same problem. Finally, update a couple places to bitwise-OR the
> > > reclaim flag to avoid smashing the flush lock in the process (which is
> > > based on an inode flag in current kernels). This avoids a (spurious)
> > > failure of the newly introduced xfs_ifunlock() assertion.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > > Reported-by: Zorro Lang <zlang@xxxxxxxxxx>
> > > ---
> > > 
> > > v2:
> > > - Add comment in xfs_reclaim_inode() wrt to flush lock.
> > > - Fix XFS_IRECLAIM usage in xfs_inode_free().
> > > 
> > >  fs/xfs/xfs_icache.c |  9 +++++++--
> > >  fs/xfs/xfs_inode.h  | 11 ++++++-----
> > >  2 files changed, 13 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 14796b7..2317b74 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -140,7 +140,7 @@ xfs_inode_free(
> > >  	 * races.
> > >  	 */
> > >  	spin_lock(&ip->i_flags_lock);
> > > -	ip->i_flags = XFS_IRECLAIM;
> > > +	ip->i_flags |= XFS_IRECLAIM;
> > >  	ip->i_ino = 0;
> > >  	spin_unlock(&ip->i_flags_lock);
> > >  
> > > @@ -981,7 +981,12 @@ restart:
> > >  
> > >  	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> > >  		xfs_iunpin_wait(ip);
> > > +		/*
> > > +		 * xfs_iflush_abort() drops the flush lock. Reacquire it as the
> > > +		 * reclaim code expects to drop the flush lock.
> > > +		 */
> > >  		xfs_iflush_abort(ip, false);
> > > +		xfs_iflock(ip);
> > >  		goto reclaim;
> > >  	}
> > >  	if (xfs_ipincount(ip)) {
> > > @@ -1044,7 +1049,7 @@ reclaim:
> > >  	 * skip.
> > >  	 */
> > >  	spin_lock(&ip->i_flags_lock);
> > > -	ip->i_flags = XFS_IRECLAIM;
> > > +	ip->i_flags |= XFS_IRECLAIM;
> > >  	ip->i_ino = 0;
> > >  	spin_unlock(&ip->i_flags_lock);
> > 
> > I'd prefer that we don't change this - an inode that has had it's
> > inode number cleared should also only have XFS_IRECLAIM set in it's
> > flags.
> > 
> 
> Ok, but IMO that means we should be doing something like this:
> 
> 	ASSERT(ip->i_flags == 0);
> 	ip->i_flags = XFS_IRECLAIM;
> 

Testing uncovered pretty quickly that I misread (i.e., reclaim should be
the only bit set after this point) and this assertion is bogus. In fact,
it looks like this could technically be something like the following:

	ASSERT(ip->i_flags & XFS_IRECLAIM);
	ip->i_flags &= XFS_IRECLAIM;

... because XFS_IRECLAIM is actually already set by
xfs_reclaim_inode_grab() for any inode that comes through here. But I'll
probably just tweak the assert to check !xfs_isiflocked(ip).

Brian

> > FWIW, there is nothing that should be waiting on the flush lock by
> > this point - we hold the inode locked exclusively, there are not
> > other references to the inode, and we've got a clean inode. As long
> > as we've cycled the flush lock inside the current XFS_ILOCK_EXCL
> > hold, then we can guarantee the inode is clean and nothing is
> > waiting on the flush lock as you have to hold the ILOCK before
> > grabbing the flush lock.
> > 
> > Hence it doesn't matter if we hold or don't hold the flush lock
> > at the point where we set ip->i_flags = XFS_IRECLAIM and ip->i_ino =
> > 0 - clearing the bit drops the flush lock if we hold it, and
> > clearing ip->i_ino means that any lookups inside the RCU grace
> > period (either from xfs_iflush_cluster or cache lookups) means
> > they'll drop the inode without doing anything with it.
> > 
> > Hence it think we can simple do something like:
> > 
> 
> Hmm, Ok. It does seem reasonable that we don't need the flush lock
> across the reclaim bits.
> 
> >  	}
> >  
> > -	xfs_iflock(ip);
> >  reclaim:
> >  	/*
> >  	 * Because we use RCU freeing we need to ensure the inode always appears
> >  	 * to be reclaimed with an invalid inode number when in the free state.
> >  	 * We do this as early as possible under the ILOCK and flush lock so
> >  	 * that xfs_iflush_cluster() can be guaranteed to detect races with us
> >  	 * here. By doing this, we guarantee that once xfs_iflush_cluster has
> >  	 * locked both the XFS_ILOCK and the flush lock that it will see either
> >  	 * a valid, flushable inode that will serialise correctly against the
> >  	 * locks below, or it will see a clean (and invalid) inode that it can
> >  	 * skip.
> > +	 *
> 
> In that case, the above needs to be tweaked to drop the flush lock bits
> as well.
> 
> > +	 * If we are currently holding the flush lock, then nothing else can
> > +	 * be waiting on it as we hold the XFS_ILOCK_EXCL. Resetting the i_flags
> > +	 * here effectively unlocks the flush lock if we currently hold it,
> > +	 * but it's a no-op if we don't hold it. This means we don't have to
> > +	 * add a lock cycle to the paths that drop the flush lock due to
> > +	 * IO completion or abort here.
> >  	 */
> 
> I really don't like the lazy overload of the i_flags assignment to
> "maybe" unlock the flush lock. As proven by this patch, this has the
> potential to hide bugs based on flag usage (whether associated with
> flush locking or some unforeseen flags added down the road). It also
> sets a landmine should we decide to change the flush lock mechanism
> (which we've obviously done at least once before) to something that
> doesn't rely on i_flags.
> 
> IOW, I'm simply saying that if the flush lock is not required here, then
> there is really no correctness requirement to "potentially" release it
> atomically (with respect to i_flags_lock) with the XFS_IRECLAIM
> assignment. Instead, I'd prefer to avoid the trickiness and just drop it
> in the one or two places earlier in the function that currently jump to
> 'reclaim:' with the flush lock held. 
> 
> >  	spin_lock(&ip->i_flags_lock);
> >  	ip->i_flags = XFS_IRECLAIM;
> >  	ip->i_ino = 0;
> >  	spin_unlock(&ip->i_flags_lock);
> >  
> > -	xfs_ifunlock(ip);
> >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > 
> > The rest of the code (adding the assert to xfs_ifunlock()) looks
> > fine, but this code here is quite special because of the way it
> > controls behaviour of the inode lookups within the RCU grace
> > period...
> > 
> 
> I'm assuming you'd prefer to leave the XFS_IRECLAIM assignment in
> xfs_inode_free() as well (in which case, I'll move the xfs_isiflocked()
> assert to before the assignment and probably add a similar i_flags == 0
> assert)..?
> 
> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@xxxxxxxxxxxxx
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html