Re: [PATCH RFC] xfs: hold buffer across unpin and potential shutdown processing

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Wed, 2 Jun 2021 09:35:37 -0700

On Wed, Jun 02, 2021 at 12:31:54PM -0400, Brian Foster wrote:
> On Wed, Jun 02, 2021 at 09:02:38AM -0700, Darrick J. Wong wrote:
> > On Wed, Jun 02, 2021 at 09:32:34AM -0400, Brian Foster wrote:
> > > On Thu, May 06, 2021 at 03:29:50PM -0400, Brian Foster wrote:
> > > > On Thu, May 06, 2021 at 12:56:11PM +1000, Dave Chinner wrote:
> > > > > On Wed, May 05, 2021 at 07:50:17AM -0400, Brian Foster wrote:
> > > > > > On Tue, May 04, 2021 at 09:25:39AM +1000, Dave Chinner wrote:
> > > > > > > On Mon, May 03, 2021 at 08:18:16AM -0400, Brian Foster wrote:
> > > > > > > i.e. the problem here is that we've dropped the bip->bli_refcount
> > > > > > > before we've locked the buffer and taken a reference to it for
> > > > > > > the fail path?
> > > > > > > 
> > > > > > > OK, I see that xfs_buf_item_done() (called from ioend processing)
> > > > > > > simply frees the buf log item and doesn't care about the bli
> > > > > > > refcount at all. So the first ioend caller will free the buf log
> > > > > > > item regardless of whether there are other references to it at all.
> > > > > > > 
> > > > > > > IOWs, once we unpin the buffer, the bli attached to the buffer and
> > > > > > > being tracked in the AIL has -zero- references to the bli and so it
> > > > > > > gets freed unconditionally on IO completion.
> > > > > > > 
> > > > > > > That seems to the be the problem here - the bli is not reference
> > > > > > > counted while it is the AIL....
> > > > > > > 
> > > > > > 
> > > > > > I think it depends on how you look at it. As you point out, we've had
> > > > > > this odd bli reference count pattern for as long as I can remember where
> > > > > 
> > > > > Yes, and it's been a constant source of use-after free bugs in
> > > > > shutdown processing for as long as I can remember. I want to fix it
> > > > > so we don't have to keep band-aiding this code every time we change
> > > > > how some part of log item or stale inode/buffer processing works...
> > > > > 
> > > > 
> > > > IME, most of the bugs in this area tend to be shutdown/error related and
> > > > generally relate to the complexity of all the various states and
> > > > contexts for the different callback contexts. That isn't a direct result
> > > > of this bli refcount behavior, as odd as it is, though that certainly
> > > > contributes to the overall complexity.
> > > > 
> > > > > 
> > > > > > >  
> > > > > > > -	freed = atomic_dec_and_test(&bip->bli_refcount);
> > > > > > > -
> > > > > > > +	/*
> > > > > > > +	 * We can wake pin waiters safely now because we still hold the
> > > > > > > +	 * bli_refcount that was taken when the pin was gained.
> > > > > > > +	 */
> > > > > > >  	if (atomic_dec_and_test(&bp->b_pin_count))
> > > > > > >  		wake_up_all(&bp->b_waiters);
> > > > > > >  
> > > > > > > -	if (freed && stale) {
> > > > > > > -		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> > > > > > > -		ASSERT(xfs_buf_islocked(bp));
> > > > > > > -		ASSERT(bp->b_flags & XBF_STALE);
> > > > > > > -		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
> > > > > > > -
> > > > > > > -		trace_xfs_buf_item_unpin_stale(bip);
> > > > > > > -
> > > > > > > -		if (remove) {
> > > > > > > -			/*
> > > > > > > -			 * If we are in a transaction context, we have to
> > > > > > > -			 * remove the log item from the transaction as we are
> > > > > > > -			 * about to release our reference to the buffer.  If we
> > > > > > > -			 * don't, the unlock that occurs later in
> > > > > > > -			 * xfs_trans_uncommit() will try to reference the
> > > > > > > -			 * buffer which we no longer have a hold on.
> > > > > > > -			 */
> > > > > > > -			if (!list_empty(&lip->li_trans))
> > > > > > > -				xfs_trans_del_item(lip);
> > > > > > > -
> > > > > > > -			/*
> > > > > > > -			 * Since the transaction no longer refers to the buffer,
> > > > > > > -			 * the buffer should no longer refer to the transaction.
> > > > > > > -			 */
> > > > > > > -			bp->b_transp = NULL;
> > > > > > > +	if (!stale) {
> > > > > > > +		if (!remove) {
> > > > > > > +			/* Nothing to do but drop the refcount the pin owned. */
> > > > > > > +			atomic_dec(&bip->bli_refcount);
> > > > > > > +			return;
> > > > > > >  		}
> > > > > > 
> > > > > > Hmm.. this seems a bit wonky to me. This code historically acts on the
> > > > > > drop of the final reference to the bli.
> > > > > 
> > > > > Yes, and that's the problem that needs fixing. The AIL needs a
> > > > > reference so that we aren't racing with writeback from the AIL to
> > > > > free the object because both sets of code run without actually
> > > > > holding an active reference to the BLI...
> > > > > 
> > > > > > This is not critical for the
> > > > > > common (!stale && !remove) case because that's basically a no-op here
> > > > > > outside of dropping the reference, and it looks like the stale buffer
> > > > > > handling code further down continues to follow that model, but in this
> > > > > > branch it seems we're trying to be clever in how the reference is
> > > > > > managed and as a result can act on a bli that might actually have
> > > > > > additional references.
> > > > > 
> > > > > Who cares? If something else has active references, then we must not
> > > > > free the bli or buffer here, anyway. The lack of active references
> > > > > by active BLI usres is why we keep getting use-after-free bugs in
> > > > > this code.....
> > > > > 
> > > > 
> > > > Well, this impacts more than just whether we free the buffer or not in
> > > > the abort case. This potentially runs the I/O failure sequence on a
> > > > pinned buffer, or blocks log I/O completion on a buffer lock that might
> > > > be held by a transaction.
> > > > 
> > > > I don't know if these are immediate problems or not and this is all
> > > > abort/shutdown related, but re: my point above around shutdown issues,
> > > > I'd prefer to try and avoid these kind of oddball quirks if we can vs.
> > > > just replace the historical quirks with new ones.
> > > > 
> > > > > > If so, I don't think it's appropriate to run
> > > > > > through the error sequence that follows.
> > > > > > 
> > > > > > >  
> > > > > > >  		/*
> > > > > > > -		 * If we get called here because of an IO error, we may or may
> > > > > > > -		 * not have the item on the AIL. xfs_trans_ail_delete() will
> > > > > > > -		 * take care of that situation. xfs_trans_ail_delete() drops
> > > > > > > -		 * the AIL lock.
> > > > > > > -		 */
> > > > > > > -		if (bip->bli_flags & XFS_BLI_STALE_INODE) {
> > > > > > > -			xfs_buf_item_done(bp);
> > > > > > > -			xfs_buf_inode_iodone(bp);
> > > > > > > -			ASSERT(list_empty(&bp->b_li_list));
> > > > > > > -		} else {
> > > > > > > -			xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
> > > > > > > -			xfs_buf_item_relse(bp);
> > > > > > > -			ASSERT(bp->b_log_item == NULL);
> > > > > > > -		}
> > > > > > > -		xfs_buf_relse(bp);
> > > > > > > -	} else if (freed && remove) {
> > > > > > > -		/*
> > > > > > > +		 * Fail the IO before we drop the bli refcount. This guarantees
> > > > > > > +		 * that a racing writeback completion also failing the buffer
> > > > > > > +		 * and running completion will not remove the last reference to
> > > > > > > +		 * the bli and free it from under us.
> > > > > > > +		 *
> > > > > > >  		 * The buffer must be locked and held by the caller to simulate
> > > > > > >  		 * an async I/O failure.
> > > > > > >  		 */
> > > > > > > @@ -559,7 +555,62 @@ xfs_buf_item_unpin(
> > > > > > >  		xfs_buf_hold(bp);
> > > > > > >  		bp->b_flags |= XBF_ASYNC;
> > > > > > >  		xfs_buf_ioend_fail(bp);
> > > > > > > +		xfs_buf_item_relse(bp);
> > > > > > 
> > > > > > Did you mean for this to be xfs_buf_item_put() instead of _relse()? The
> > > > > 
> > > > > Yes. I did say "untested" which implies the patch isn't complete or will
> > > > > work. It's just a demonstration of how this reference counting might
> > > > > be done, not a complete, working solution. Patches are a much faster
> > > > > way of explaining the overall solution that plain text...
> > > > > 
> > > > 
> > > > Sure, I'm just trying to clarify intent. It's a refcount patch so
> > > > whether we drop a refcount or explicitly free an objects is particularly
> > > > relevant. ;)
> > > > 
> > > > > > >  	}
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * Stale buffer - only process it if this is the last reference to the
> > > > > > > +	 * BLI. If this is the last BLI reference, then the buffer will be
> > > > > > > +	 * locked and have two references - once from the transaction commit and
> > > > > > > +	 * one from the BLI - and we do not unlock and release transaction
> > > > > > > +	 * reference until we've finished cleaning up the BLI.
> > > > > > > +	 */
> > > > > > > +	if (!atomic_dec_and_test(&bip->bli_refcount))
> > > > > > > +		return;
> > > > > > > +
> > > > > > 
> > > > > > If the buffer is stale, will this ever be the last reference now that
> > > > > > _item_committed() bumps the refcount?
> > > > > 
> > > > > If the AIL has a reference, then no.
> > > > > 
> > > > 
> > > > Ok, but how would we get here without an AIL reference? That seems
> > > > impossible to me based on your patch.
> > > > 
> > > > > > This change also seems to have
> > > > > > ramifications for the code that follows, such as if a staled buffer is
> > > > > > already in the AIL (with a bli ref), would this code ever get to the
> > > > > > point of removing it?
> > > > > 
> > > > > That's easy enough to handle - if the buffer is stale and we are in
> > > > > this code, we hold the buffer locked. Hence we can remove from the
> > > > > AIL if it is in the AIL and drop that reference, too. Indeed, this
> > > > > code already does the AIL removal, so all this requires is a simple
> > > > > rearrangement of the logic in this function....
> > > > > 
> > > > 
> > > > Hmm, I don't think that's a correct assertion. If the buffer is stale
> > > > and we're in this code and we have the last reference to the bli, then
> > > > we know we hold the buffer locked. Otherwise, ISTM we can get here while
> > > > the transaction that staled the buffer might still own the lock.
> > > > 
> > > > > The only difference is that we have to do this before we drop the
> > > > > current reference we have on the BLI, which is not what we do now
> > > > > and that's where all the problems lie.
> > > > > 
> > > > > > All in all, I'll reiterate that I think it would be nice to fix up the
> > > > > > bli reference count handling in general, but I think the scope and
> > > > > > complexity of that work is significantly beyond what is reasonably
> > > > > > necessary to fix this bug.
> > > > > 
> > > > > And so leaving the underlying problem in the code for the next set
> > > > > of changes someone does to trigger the problem in a differen way.
> > > > > We've indentified what the root cause is, so can we please spend
> > > > > the time to fix it properly?
> > > > > 
> > > > 
> > > > That is not what I'm saying at all. I'm saying that I'd prefer to fix
> > > > the bug first and then step back and evaluate the overall refcount
> > > > design independently because the latter is quite complex and there are
> > > > all kinds of subtle interactions that the RFC patch just glazes over (by
> > > > design). For example, how log recovery processes bli's slightly
> > > > differently looks like yet another impedence mismatch from when the fs
> > > > is fully active.
> > > > 
> > > 
> > > Just an update on this particular bit...
> > > 
> > > I've probably spent about a week and a half now working through some
> > > attempts to rework the bli refcount handling (going back to the drawing
> > > board at least a couple times) and while I can get to something mostly
> > > functional (surviving an fstests run), the last steps to test/prove a
> > > solid/reliable implementation end up falling short. On extended testing
> > > I end up sorting through the same kind of shutdown/unmount hang, use
> > > after free, extremely subtle interactions that I've spent many hours
> > > trying to stamp out over the past several years.
> > > 
> > > Because of that, I don't think this is something worth pushing for in
> > > the short term. While the proposed idea sounds nice in theory, my take
> > > away in practice is that the current design is in place for a reason
> > > (i.e. the refcount historically looks like a transaction reference count
> > > used to provide unlocked/isolated access through the log subsystem as
> > > opposed to a traditional memory object lifecycle reference count) and
> > > has quite a lot of incremental stability that has been baked in over
> > > time.
> > > 
> > > I do still think this is (or should be :P) ultimately fixable, but I'm
> > > wondering if we're probably better served by exploring some of the
> > > historical warts and subtle rules/dependencies/hurdles of the current
> > > model to see if they can be removed or simplified to the point where the
> > > reference counting incrementally and naturally becomes a bit more
> > > straightforward. I probably need to reset my brain and think a little
> > > more about that...
> > 
> > What do you want to do in the meantime?
> > 
> > How about: Elevate this patch from RFC to regular patch status (perhaps
> > with an XXX comment outlining the gap as a breadcrumb to remind future
> > us?) and merge that so that we at least fix the immediate UAF problem?
> > 
> 
> Well I had already sent a v1 [1] of the original UAF fix independent
> from this work because I didn't want to bury the bug fix behind a
> broader rework. I just wanted to follow up here because I've been
> looking into the latter and this is where most of the discussion around
> the design rework took place.
> 
> That said, I am currently looking at (yet another) alternative approach
> that might be more in the direction of broadly cleaning things up vs.
> stamping out an isolated problem. I just don't know how it will turn out
> quite yet, so I could go either way between waiting on that and possibly
> sending a v2, or going with v1 for now (assuming it's reviewed) and
> basing subsequent cleanups on that. My instinct is that to leave a
> record of an isolated fix is probably the proper thing to do regardless,
> but I don't feel that strongly about it.

Aha, you /did/ send a v1.  I thought I had seen something like that.

Oh, I reviewed it too.  Hm.  Do you have any changes after v1.1, or do
you want me to pull that in as it is now?

--D

> 
> Brian
> 
> [1] https://lore.kernel.org/linux-xfs/20210511135257.878743-1-bfoster@xxxxxxxxxx/
> 
> > I suspected that figuring out all the subtleties of the bli lifetimes
> > would be an intense effort.
> > 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > So I propose to rework my patch a bit into something that reflects the
> > > > intended direction (see below for an untested diff) and proceed from
> > > > there...
> > > > 
> > > > Brian
> > > > 
> > > > --- 8< ---
> > > > 
> > > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > > > index fb69879e4b2b..7ff31788512b 100644
> > > > --- a/fs/xfs/xfs_buf_item.c
> > > > +++ b/fs/xfs/xfs_buf_item.c
> > > > @@ -475,17 +475,8 @@ xfs_buf_item_pin(
> > > >  }
> > > >  
> > > >  /*
> > > > - * This is called to unpin the buffer associated with the buf log
> > > > - * item which was previously pinned with a call to xfs_buf_item_pin().
> > > > - *
> > > > - * Also drop the reference to the buf item for the current transaction.
> > > > - * If the XFS_BLI_STALE flag is set and we are the last reference,
> > > > - * then free up the buf log item and unlock the buffer.
> > > > - *
> > > > - * If the remove flag is set we are called from uncommit in the
> > > > - * forced-shutdown path.  If that is true and the reference count on
> > > > - * the log item is going to drop to zero we need to free the item's
> > > > - * descriptor in the transaction.
> > > > + * This is called to unpin the buffer associated with the buf log item which
> > > > + * was previously pinned with a call to xfs_buf_item_pin().
> > > >   */
> > > >  STATIC void
> > > >  xfs_buf_item_unpin(
> > > > @@ -502,12 +493,26 @@ xfs_buf_item_unpin(
> > > >  
> > > >  	trace_xfs_buf_item_unpin(bip);
> > > >  
> > > > +	/*
> > > > +	 * Drop the bli ref associated with the pin and grab the hold required
> > > > +	 * for the I/O simulation failure in the abort case. We have to do this
> > > > +	 * before the pin count drops because the AIL doesn't acquire a bli
> > > > +	 * reference. Therefore if the refcount drops to zero, the bli could
> > > > +	 * still be AIL resident and the buffer submitted for I/O (and freed on
> > > > +	 * completion) at any point before we return. This can be removed once
> > > > +	 * the AIL properly holds a reference on the bli.
> > > > +	 */
> > > >  	freed = atomic_dec_and_test(&bip->bli_refcount);
> > > > -
> > > > +	if (freed && !stale && remove)
> > > > +		xfs_buf_hold(bp);
> > > >  	if (atomic_dec_and_test(&bp->b_pin_count))
> > > >  		wake_up_all(&bp->b_waiters);
> > > >  
> > > > -	if (freed && stale) {
> > > > +	 /* nothing to do but drop the pin count if the bli is active */
> > > > +	if (!freed)
> > > > +		return;
> > > > +
> > > > +	if (stale) {
> > > >  		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> > > >  		ASSERT(xfs_buf_islocked(bp));
> > > >  		ASSERT(bp->b_flags & XBF_STALE);
> > > > @@ -550,13 +555,13 @@ xfs_buf_item_unpin(
> > > >  			ASSERT(bp->b_log_item == NULL);
> > > >  		}
> > > >  		xfs_buf_relse(bp);
> > > > -	} else if (freed && remove) {
> > > > +	} else if (remove) {
> > > >  		/*
> > > >  		 * The buffer must be locked and held by the caller to simulate
> > > > -		 * an async I/O failure.
> > > > +		 * an async I/O failure. We acquired the hold for this case
> > > > +		 * before the buffer was unpinned.
> > > >  		 */
> > > >  		xfs_buf_lock(bp);
> > > > -		xfs_buf_hold(bp);
> > > >  		bp->b_flags |= XBF_ASYNC;
> > > >  		xfs_buf_ioend_fail(bp);
> > > >  	}
> > > 
> > 
>