Re: [PATCH 3/3] xfs: fix an incore inode UAF in xfs_bui_recover

Brian Foster <bfoster@xxxxxxxxxx> · Thu, 7 May 2020 12:58:07 -0400

On Thu, May 07, 2020 at 08:09:56AM -0700, Darrick J. Wong wrote:
> On Thu, May 07, 2020 at 05:53:06AM -0400, Brian Foster wrote:
> > On Wed, May 06, 2020 at 10:01:50AM -0700, Darrick J. Wong wrote:
> > > On Wed, May 06, 2020 at 09:56:56AM -0400, Brian Foster wrote:
> > > > On Tue, May 05, 2020 at 05:34:23PM -0700, Darrick J. Wong wrote:
> > > > > On Tue, May 05, 2020 at 10:11:38AM -0400, Brian Foster wrote:
> > > > > > On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote:
> > > > > > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > > > > > 
> > > > > > > In xfs_bui_item_recover, there exists a use-after-free bug with regards
> > > > > > > to the inode that is involved in the bmap replay operation.  If the
> > > > > > > mapping operation does not complete, we call xfs_bmap_unmap_extent to
> > > > > > > create a deferred op to finish the unmapping work, and we retain a
> > > > > > > pointer to the incore inode.
> > > > > > > 
> > > > > > > Unfortunately, the very next thing we do is commit the transaction and
> > > > > > > drop the inode.  If reclaim tears down the inode before we try to finish
> > > > > > > the defer ops, we dereference garbage and blow up.  Therefore, create a
> > > > > > > way to join inodes to the defer ops freezer so that we can maintain the
> > > > > > > xfs_inode reference until we're done with the inode.
> > > > > > > 
> > > > > > > Note: This imposes the requirement that there be enough memory to keep
> > > > > > > every incore inode in memory throughout recovery.
> > > > > > > 
> > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > > > > > ---
> > > > > > 
> > > > > > Maybe I'm missing something, but I thought the discussion on the
> > > > > > previous version[1] landed on an approach where the intent would hold a
> > > > > > reference to the inode. Wouldn't that break the dependency on the dfops
> > > > > > freeze/thaw mechanism?
> > > > > 
> > > > > We did, but as I started pondering how to make this work in practice,
> > > > > I soon realized that it's not as simple as "don't irele the inode":
> > > > > 
> > > > > (Part 1)
> > > > > 
> > > > > It would be easy enough to add a flag to struct xfs_bmap_intent that
> > > > > means "when you're done, unlock and irele the inode".  You'd have to add
> > > > > this capability to any log item that deals with inodes, but only
> > > > > deferred bmap items need this.
> > > > > 
> > > > > Log recovery looks like this:
> > > > > 
> > > > > ----------------
> > > > > Transaction 0
> > > > > BUI1 unmap file1, offset1, startblock1, length == 200000
> > > > > ----------------
> > > > > <end>
> > > > > 
> > > > > What happens if the first bunmapi call doesn't actually unmap
> > > > > everything?  We'll queue yet another bmap intent item (call it BUI2) to
> > > > > finish the work.  This means that we can't just drop the inode when
> > > > > we're done processing BUI1; we have to propagate the "unlock and irele"
> > > > > flag to BUI2.  Ok, so now there's this chaining behavior that wasn't
> > > > > there before.
> > > > > 
> > > > > Then I thought about it some more, and wondered what would happen if the
> > > > > log contained two unfinished bmap items for the same inode?  If this
> > > > > happened, you'd deadlock because you've not released the ILOCK for the
> > > > > first unfinished bmap item when you want to acquire the ILOCK for the
> > > > > second unfinished bmap item.
> > > > > 
> > > > 
> > > > Hm, Ok.. so IIUC the lock is implicitly held across these map/unmap
> > > > dfops at runtime, but that's not necessarily the case during recovery
> > > > due to the dfops chaining behavior. To me, this raises the question of
> > > > why we need to reorder/split these particular intents during recovery if
> > > > the original operation would have completed under a single inode lock
> > > > cycle.
> > > 
> > > Log recovery /did/ function that way (no reordering) until
> > > 509955823cc9c, when it was discovered that we don't have a good way to
> > > reconstruct the dfops chain from the list of recovered log intent items.
> > > 
> > > > The comments in xfs_log_recover.c don't really explain it. From going
> > > > back to commit 509955823cc9c ("xfs: log recovery should replay deferred
> > > > ops in order"), it looks like the issue is that ordering gets screwed up
> > > > for operations that commit multiple intents per-transaction and
> > > > completion of those intents creates new child intents. At runtime, we'd
> > > > process the intents in the current transaction first and then get to the
> > > > child intents in the order they were created. If we didn't have the
> > > > recovery time dfops shuffling implemented in this patch, recovery would
> > > > process the first intent in the log, then all of the child intents of
> > > > that one before getting to the next intent in the log that presumably
> > > > would have been second at runtime. Am I following that correctly?
> > > 
> > > Correct.  If we could figure out how to rebuild the dfops chain (and
> > > thus the ordering requirements), we would be able to eliminate this
> > > freezer stuff entirely.
> > > 
> > 
> > Ok, thanks. I was wondering the same, but it's not clear how to do that
> > with just the intent information in the log and big checkpoint
> > transactions. Perhaps we could if we had a transactional id that was
> > common across intents in a single transaction, or otherwise unique
> > intent ids that would allow an intent to refer to the next in a chain at
> > the time the intent was logged (such that recovery could learn to
> > construct and process chains instead of just individual intents). I'm
> > curious if we're going to want/need to solve that issue somehow or
> > another more directly eventually as we grow more intents and perhaps
> > come up with new and more complex intent-based operations (with more
> > complex recovery scenarios).
> 
> <nod>  Too bad that's a log format change. :(
> 

Yeah. It might be worth considering if we have an opportunity to break
format in the future, but not useful for the current code.

> > > > I think the rest makes sense from the perspective of not wanting to
> > > > plumb the conditional locking complexity through the dfops/intent
> > > > processing for every intent type that eventually needs it, as opposed to
> > > > containing in the log recovery code (via dfops magic). I need to think
> > > > about all that some more, but I'd like to make sure I understand why we
> > > > have this recovery requirement in the first place.
> > > 
> > > <nod>  It took me a few hours to figure out why that patch was correct
> > > the first time I saw it.  It took me a few hours again(!) when I was
> > > trying to write this series. :/
> > > 
> > 
> > If I follow correctly, the rmap swapext sequence makes for a good
> > example because the bmap completions generate the rmap intents. A
> > simplified (i.e. single extent remap) runtime transactional flow for
> > that looks something like the following:
> > 
> > t1: unmap intent, map intent
> > t2: unmap done, map intent, runmap intent
> > t3: map done, runmap intent, rmap intent
> > t4: runmap done, rmap intent
> 
> Work items that aren't touched in a particular transaction do not get
> relogged, so this sequence is:
> 
> t1: unmap(0) intent, map(1) intent
> t2: unmap(0) done, runmap(2) intent
> t3: map(1) done, rmap(3) intent
> t4: runmap(2) done
> t5: rmap(3) done
> 
> (Maybe I need to take another look at the autorelogging patchset.)
> 

Ah, I see. Thanks.

Brian

> > ...
> > 
> > So at runtime these operations complete in the obvious linear order.
> > Intents drop off the front as done items are logged and new intents are
> > tacked on the end. If we crash between t2 and t3, however, we see t2 in
> > the log and would end up completing the map intent and the subsequently
> > generated rmap intent (that pops up in t3 at runtime) before we process
> > the runmap intent that's still in the log (and was originally in the
> > dfops queue for t2). I _think_ that's essentially the scenario described
> > in 509955823cc9c,
> 
> Yes...
> 
> > but it's not clear to me if it's the same runtime example..
> 
> ...and you arrived at it with a different scenario. :)
> 
> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > You'd have to drop the ILOCK (but retain the inode reference) between
> > > > > the two bmap items.  I think this is currently impossible because there
> > > > > aren't any transactions that queue multiple bmap intents per
> > > > > transaction, nobody else holds the ilock, and the existing rmap
> > > > > implementation of the swapext ioctl doesn't allow swapping a file with
> > > > > itself.  However...
> > > > > 
> > > > > (Part 2)
> > > > > 
> > > > > The second thing I thought about is, what if there's a lot of work after
> > > > > transaction 0?  Say the recovery stream looks like:
> > > > > 
> > > > > ----------------
> > > > > Transaction 0
> > > > > BUI1 unmap file1, offset1, startblock1, length == 200000
> > > > > ----------------
> > > > > <10000 more intents applied to other things>
> > > > > ----------------
> > > > > <end>
> > > > > 
> > > > > This is a problem, because we must drop the ILOCK after completing the
> > > > > BUI1 work so that we can process the 10000 more intents before we can
> > > > > get to BUI2.  We cannot let the tail of the log get pinned on the locked
> > > > > inode, so we must release the ILOCK after we're done with BUI1 and
> > > > > re-acquire it when we're ready to deal with BUI2.
> > > > > 
> > > > > This means I need /two/ flags in struct xfs_bmap_intent -- one to say
> > > > > that we need to irele the inode at the end of processing, and another to
> > > > > say that the bmap item needs to grab the ILOCK.  If there's going to be
> > > > > a BUI2 as a result of recovering BUI1, the first flag must be propagated
> > > > > to BUI2, but the second flag must /not/ be propagated.
> > > > > 
> > > > > A potential counterargument is that we crammed all 10000 intents into
> > > > > the log without pinning the log, so maybe we can hold the ILOCK.
> > > > > However....
> > > > > 
> > > > > (Part 3)
> > > > > 
> > > > > Next, I considered how the swapext log item in the atomic file update
> > > > > patchset would interact with log recovery.  A couple of notes about the
> > > > > swapext log items:
> > > > > 
> > > > > 1. They are careful not to create the kinds of bunmap operations that
> > > > > would cause the creation of /more/ BUIs.
> > > > > 
> > > > > 2. Every time we log one extent's worth of unmap/remap operations (using
> > > > > deferred bmap log items, naturally) we log an done item for the original
> > > > > swapext intent item and immediately log another swapext intent for the
> > > > > work remaining.
> > > > > 
> > > > > I came up with the following sequence of transactions to recover:
> > > > > 
> > > > > ----------------
> > > > > Transaction 0
> > > > > SXI0 file1, offset1, file2, offset2, length == 10
> > > > > ----------------
> > > > > Transaction 1
> > > > > BUI1 unmap file1, offset1, startblock1, length == 2
> > > > > BUI2 unmap file2, offset2, startblock2, length == 2
> > > > > BUI3 map file1, offset1, startblock2, length == 2
> > > > > BUI4 map file2, offset2, startblock1, length == 2
> > > > > SXD done (SXI0)
> > > > > SXI5 file1, offset1 + 2, file2, offset2 + 2, length == 8
> > > > > ---------------
> > > > > <end of log>
> > > > > 
> > > > > Recovery in this case will end up with BUI[1-4] and SXI5 that still need
> > > > > processing.  Each of the 5 intent items gets its own recovery
> > > > > transaction and dfops freezer chain.  BUI1, BUI3, and SXI5 will each
> > > > > require ilock'd access to file1, which means that each of them will have
> > > > > to have the ability to drop the ILOCK but retain the reference.  The
> > > > > same argument applies to file2 and BUI2, BUI4, and SXI5.  Note also that
> > > > > in our brave new world, file1 and file2 can be the same.
> > > > > 
> > > > > For the swapext log item type this inode management is particularly
> > > > > acute because we're certain to have new SXIs created as a result of
> > > > > recovering an SXI.
> > > > > 
> > > > > (End)
> > > > > 
> > > > > In conclusion, we need a mechanism to drop both the inode lock and the
> > > > > inode reference.  Each log item type that works with inodes will have to
> > > > > set aside 2 flags somewhere in the incore extent structure, and provide
> > > > > more or less the same code to manage that state.  Right now that's just
> > > > > the bmap items, but then the swapext items and the deferred xattr items
> > > > > will have that same requirement when they get added.
> > > > > 
> > > > > If we do that, the inode reference and ilock management diverges even
> > > > > further from regular operations.  Regular callers are expected to iget
> > > > > and ilock the inode and maintain that all the way to the end of
> > > > > xfs_trans_commit.  Log recovery, on the other hand, will add a bunch of
> > > > > state handling to some of the log items so that we can drop the ILOCK
> > > > > after the first item, and then drop the inode reference after finishing
> > > > > the last possible user of that inode in the revived dfops chain.
> > > > > 
> > > > > On the other hand, keeping the inode management in the defer freezer
> > > > > code means that I keep all that complexity and state management out of
> > > > > the ->finish_item code.  When we go to finish the new incore intents
> > > > > that get created during recovery, the inode reference and ILOCK rules
> > > > > are the same as anywhere else -- the caller is required to grab the
> > > > > inode, the transaction, and the ilock (in that order).
> > > > > 
> > > > > To the extent that recovering log intent items is /not/ the same as
> > > > > running a normal transaction, all that weirdness is concentrated in
> > > > > xfs_log_recover.c and a single function in xfs_defer.c.  The desired
> > > > > behavior is the same across all the log intent item types, so I
> > > > > decided in favor of keeping the centralised behavior and (trying to)
> > > > > contain the wacky.  We don't need all that right away, but we will.
> > > > > 
> > > > > Note that I did make it so that we retain the reference always, so
> > > > > there's no longer a need for irele/igrab cycles, which makes capturing
> > > > > the dfops chain less error prone.
> > > > > 
> > > > > --D
> > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > [1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/
> > > > > > 
> > > > > > >  fs/xfs/libxfs/xfs_defer.c |   50 +++++++++++++++++++++++++++++++++++++++++++++
> > > > > > >  fs/xfs/libxfs/xfs_defer.h |   10 +++++++++
> > > > > > >  fs/xfs/xfs_bmap_item.c    |    7 ++++--
> > > > > > >  fs/xfs/xfs_icache.c       |   19 +++++++++++++++++
> > > > > > >  4 files changed, 83 insertions(+), 3 deletions(-)
> > > > > > > 
> > > > > > > 
> > > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > > > > > > index ea4d28851bbd..72933fdafcb2 100644
> > > > > > > --- a/fs/xfs/libxfs/xfs_defer.c
> > > > > > > +++ b/fs/xfs/libxfs/xfs_defer.c
> > > > > > > @@ -16,6 +16,7 @@
> > > > > > >  #include "xfs_inode.h"
> > > > > > >  #include "xfs_inode_item.h"
> > > > > > >  #include "xfs_trace.h"
> > > > > > > +#include "xfs_icache.h"
> > > > > > >  
> > > > > > >  /*
> > > > > > >   * Deferred Operations in XFS
> > > > > > > @@ -583,8 +584,19 @@ xfs_defer_thaw(
> > > > > > >  	struct xfs_defer_freezer	*dff,
> > > > > > >  	struct xfs_trans		*tp)
> > > > > > >  {
> > > > > > > +	int				i;
> > > > > > > +
> > > > > > >  	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> > > > > > >  
> > > > > > > +	/* Re-acquire the inode locks. */
> > > > > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > > > > +		if (!dff->dff_inodes[i])
> > > > > > > +			break;
> > > > > > > +
> > > > > > > +		dff->dff_ilocks[i] = XFS_ILOCK_EXCL;
> > > > > > > +		xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> > > > > > > +	}
> > > > > > > +
> > > > > > >  	/* Add the dfops items to the transaction. */
> > > > > > >  	list_splice_init(&dff->dff_dfops, &tp->t_dfops);
> > > > > > >  	tp->t_flags |= dff->dff_tpflags;
> > > > > > > @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish(
> > > > > > >  	struct xfs_defer_freezer	*dff)
> > > > > > >  {
> > > > > > >  	xfs_defer_cancel_list(mp, &dff->dff_dfops);
> > > > > > > +	xfs_defer_freezer_irele(dff);
> > > > > > >  	kmem_free(dff);
> > > > > > >  }
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * Attach an inode to this deferred ops freezer.  Callers must hold ILOCK_EXCL,
> > > > > > > + * which will be dropped and reacquired when we're ready to thaw the frozen
> > > > > > > + * deferred ops.
> > > > > > > + */
> > > > > > > +int
> > > > > > > +xfs_defer_freezer_ijoin(
> > > > > > > +	struct xfs_defer_freezer	*dff,
> > > > > > > +	struct xfs_inode		*ip)
> > > > > > > +{
> > > > > > > +	unsigned int			i;
> > > > > > > +
> > > > > > > +	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > > > > > > +
> > > > > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > > > > +		if (dff->dff_inodes[i] == ip)
> > > > > > > +			goto out;
> > > > > > > +		if (dff->dff_inodes[i] == NULL)
> > > > > > > +			break;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	if (i == XFS_DEFER_FREEZER_INODES) {
> > > > > > > +		ASSERT(0);
> > > > > > > +		return -EFSCORRUPTED;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * Attach this inode to the freezer and drop its ILOCK because we
> > > > > > > +	 * assume the caller will need to allocate a transaction.
> > > > > > > +	 */
> > > > > > > +	dff->dff_inodes[i] = ip;
> > > > > > > +	dff->dff_ilocks[i] = 0;
> > > > > > > +out:
> > > > > > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > > > +	return 0;
> > > > > > > +}
> > > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > > > > > > index 7ae05e10d750..0052a0313283 100644
> > > > > > > --- a/fs/xfs/libxfs/xfs_defer.h
> > > > > > > +++ b/fs/xfs/libxfs/xfs_defer.h
> > > > > > > @@ -76,6 +76,11 @@ struct xfs_defer_freezer {
> > > > > > >  	/* Deferred ops state saved from the transaction. */
> > > > > > >  	struct list_head	dff_dfops;
> > > > > > >  	unsigned int		dff_tpflags;
> > > > > > > +
> > > > > > > +	/* Inodes to hold when we want to finish the deferred work items. */
> > > > > > > +#define XFS_DEFER_FREEZER_INODES	2
> > > > > > > +	unsigned int		dff_ilocks[XFS_DEFER_FREEZER_INODES];
> > > > > > > +	struct xfs_inode	*dff_inodes[XFS_DEFER_FREEZER_INODES];
> > > > > > >  };
> > > > > > >  
> > > > > > >  /* Functions to freeze a chain of deferred operations for later. */
> > > > > > > @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp);
> > > > > > >  void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp);
> > > > > > >  void xfs_defer_freeezer_finish(struct xfs_mount *mp,
> > > > > > >  		struct xfs_defer_freezer *dff);
> > > > > > > +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff,
> > > > > > > +		struct xfs_inode *ip);
> > > > > > > +
> > > > > > > +/* These functions must be provided by the xfs implementation. */
> > > > > > > +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff);
> > > > > > >  
> > > > > > >  #endif /* __XFS_DEFER_H__ */
> > > > > > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> > > > > > > index c733bdeeeb9b..bbce191d8fcd 100644
> > > > > > > --- a/fs/xfs/xfs_bmap_item.c
> > > > > > > +++ b/fs/xfs/xfs_bmap_item.c
> > > > > > > @@ -530,12 +530,13 @@ xfs_bui_item_recover(
> > > > > > >  	}
> > > > > > >  
> > > > > > >  	error = xlog_recover_trans_commit(tp, dffp);
> > > > > > > -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > > > -	xfs_irele(ip);
> > > > > > > -	return error;
> > > > > > > +	if (error)
> > > > > > > +		goto err_rele;
> > > > > > > +	return xfs_defer_freezer_ijoin(*dffp, ip);
> > > > > > >  
> > > > > > >  err_inode:
> > > > > > >  	xfs_trans_cancel(tp);
> > > > > > > +err_rele:
> > > > > > >  	if (ip) {
> > > > > > >  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > > >  		xfs_irele(ip);
> > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > > > index 17a0b86fe701..b96ddf5ff334 100644
> > > > > > > --- a/fs/xfs/xfs_icache.c
> > > > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > > > @@ -12,6 +12,7 @@
> > > > > > >  #include "xfs_sb.h"
> > > > > > >  #include "xfs_mount.h"
> > > > > > >  #include "xfs_inode.h"
> > > > > > > +#include "xfs_defer.h"
> > > > > > >  #include "xfs_trans.h"
> > > > > > >  #include "xfs_trans_priv.h"
> > > > > > >  #include "xfs_inode_item.h"
> > > > > > > @@ -1847,3 +1848,21 @@ xfs_start_block_reaping(
> > > > > > >  	xfs_queue_eofblocks(mp);
> > > > > > >  	xfs_queue_cowblocks(mp);
> > > > > > >  }
> > > > > > > +
> > > > > > > +/* Release all the inode resources attached to this freezer. */
> > > > > > > +void
> > > > > > > +xfs_defer_freezer_irele(
> > > > > > > +	struct xfs_defer_freezer	*dff)
> > > > > > > +{
> > > > > > > +	unsigned int			i;
> > > > > > > +
> > > > > > > +	for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) {
> > > > > > > +		if (!dff->dff_inodes[i])
> > > > > > > +			break;
> > > > > > > +
> > > > > > > +		if (dff->dff_ilocks[i])
> > > > > > > +			xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]);
> > > > > > > +		xfs_irele(dff->dff_inodes[i]);
> > > > > > > +		dff->dff_inodes[i] = NULL;
> > > > > > > +	}
> > > > > > > +}
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
>