Re: [PATCH 3/5] xfs: track cow/shared record domains explicitly in xfs_refcount_irec

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Oct 24, 2022 at 02:33:25PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@xxxxxxxxxx>
> 
> Just prior to committing the reflink code into upstream, the xfs
> maintainer at the time requested that I find a way to shard the refcount
> records into two domains -- one for records tracking shared extents, and
> a second for tracking CoW staging extents.  The idea here was to
> minimize mount time CoW reclamation by pushing all the CoW records to
> the right edge of the keyspace, and it was accomplished by setting the
> upper bit in rc_startblock.  We don't allow AGs to have more than 2^31
> blocks, so the bit was free.
> 
> Unfortunately, this was a very late addition to the codebase, so most of
> the refcount record processing code still treats rc_startblock as a u32
> and pays no attention to whether or not the upper bit (the cow flag) is
> set.  This is a weakness is theoretically exploitable, since we're not
> fully validating the incoming metadata records.
> 
> Fuzzing demonstrates practical exploits of this weakness.  If the cow
> flag of a node block key record is corrupted, a lookup operation can go
> to the wrong record block and start returning records from the wrong
> cow/shared domain.  This causes the math to go all wrong (since cow
> domain is still implicit in the upper bit of rc_startblock) and we can
> crash the kernel by tricking xfs into jumping into a nonexistent AG and
> tripping over xfs_perag_get(mp, <nonexistent AG>) returning NULL.
> 
> To fix this, start tracking the domain as an explicit part of struct
> xfs_refcount_irec, adjust all refcount functions to check the domain
> of a returned record, and alter the function definitions to accept them
> where necessary.
> 
> Found by fuzzing keys[2].cowflag = add in xfs/464.
> 
> Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> ---
>  fs/xfs/libxfs/xfs_refcount.c       |  221 ++++++++++++++++++++++++------------
>  fs/xfs/libxfs/xfs_refcount.h       |    9 +
>  fs/xfs/libxfs/xfs_refcount_btree.c |   26 ++++
>  fs/xfs/libxfs/xfs_types.h          |   10 ++
>  fs/xfs/scrub/refcount.c            |   22 ++--
>  fs/xfs/xfs_trace.h                 |   48 ++++++--
>  6 files changed, 235 insertions(+), 101 deletions(-)

My first thought was that this is complex enough that it needs to be
split up. Reading the very first hunk:

> 
> 
> diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> index 64b910caafaa..fa75e785652b 100644
> --- a/fs/xfs/libxfs/xfs_refcount.c
> +++ b/fs/xfs/libxfs/xfs_refcount.c
> @@ -46,13 +46,16 @@ STATIC int __xfs_refcount_cow_free(struct xfs_btree_cur *rcur,
>  int
>  xfs_refcount_lookup_le(
>  	struct xfs_btree_cur	*cur,
> +	enum xfs_rcext_domain	domain,
>  	xfs_agblock_t		bno,
>  	int			*stat)
>  {
> -	trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno, bno,
> +	trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno,
> +			bno | (domain == XFS_RCDOM_COW ? XFS_REFC_COW_START : 0),
>  			XFS_LOOKUP_LE);
>  	cur->bc_rec.rc.rc_startblock = bno;
>  	cur->bc_rec.rc.rc_blockcount = 0;
> +	cur->bc_rec.rc.rc_domain = domain;
>  	return xfs_btree_lookup(cur, XFS_LOOKUP_LE, stat);

I found that I had to go looking through the patch to find what the
definitions of xfs_rcext_domain, XFS_RCDOM_COW and rc_domain
actually were, because without knowing what the actual new
structures being introduced actually were this didn't make a whole
lot of sense.

Hence I think I'd like this to be broken up into a patch that
introduces the rc_domain and the helpers that build/split the
domain/startblock information in the irec, and a followup patch that
modifies the implementation to use the new domain interfaces to make
it a bit easier to see where the significant changes in the code
actually are.

I also suspect that trace_xfs_refcount_lookup() should be passed the
domain directly rather than encoding it at every call site of the
tracepoint, or it code encoded in a helper such as
xfs_refcount_domain_to_block()...

For consistency, calling the enums XFS_REFC_DOMAIN_{COW/SHARED}
would be more consistent with the naming used in other refcount
btree constants.

>  }
>  
> @@ -63,13 +66,16 @@ xfs_refcount_lookup_le(
>  int
>  xfs_refcount_lookup_ge(
>  	struct xfs_btree_cur	*cur,
> +	enum xfs_rcext_domain	domain,
>  	xfs_agblock_t		bno,
>  	int			*stat)
>  {
> -	trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno, bno,
> +	trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno,
> +			bno | (domain == XFS_RCDOM_COW ? XFS_REFC_COW_START : 0),
>  			XFS_LOOKUP_GE);
>  	cur->bc_rec.rc.rc_startblock = bno;
>  	cur->bc_rec.rc.rc_blockcount = 0;
> +	cur->bc_rec.rc.rc_domain = domain;
>  	return xfs_btree_lookup(cur, XFS_LOOKUP_GE, stat);
>  }
>  
> @@ -80,13 +86,16 @@ xfs_refcount_lookup_ge(
>  int
>  xfs_refcount_lookup_eq(
>  	struct xfs_btree_cur	*cur,
> +	enum xfs_rcext_domain	domain,
>  	xfs_agblock_t		bno,
>  	int			*stat)
>  {
> -	trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno, bno,
> +	trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno,
> +			bno | (domain == XFS_RCDOM_COW ? XFS_REFC_COW_START : 0),
>  			XFS_LOOKUP_LE);
>  	cur->bc_rec.rc.rc_startblock = bno;
>  	cur->bc_rec.rc.rc_blockcount = 0;
> +	cur->bc_rec.rc.rc_domain = domain;
>  	return xfs_btree_lookup(cur, XFS_LOOKUP_EQ, stat);
>  }
>  
> @@ -96,7 +105,17 @@ xfs_refcount_btrec_to_irec(
>  	const union xfs_btree_rec	*rec,
>  	struct xfs_refcount_irec	*irec)
>  {
> -	irec->rc_startblock = be32_to_cpu(rec->refc.rc_startblock);
> +	__u32				start;
> +
> +	start = be32_to_cpu(rec->refc.rc_startblock);
> +	if (start & XFS_REFC_COW_START) {
> +		start &= ~XFS_REFC_COW_START;
> +		irec->rc_domain = XFS_RCDOM_COW;
> +	} else {
> +		irec->rc_domain = XFS_RCDOM_SHARED;
> +	}

xfs_refcount_block_to_domain()?



> +
> +	irec->rc_startblock = start;
>  	irec->rc_blockcount = be32_to_cpu(rec->refc.rc_blockcount);
>  	irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount);
>  }
> @@ -114,7 +133,6 @@ xfs_refcount_get_rec(
>  	struct xfs_perag		*pag = cur->bc_ag.pag;
>  	union xfs_btree_rec		*rec;
>  	int				error;
> -	xfs_agblock_t			realstart;
>  
>  	error = xfs_btree_get_rec(cur, &rec, stat);
>  	if (error || !*stat)
> @@ -124,22 +142,18 @@ xfs_refcount_get_rec(
>  	if (irec->rc_blockcount == 0 || irec->rc_blockcount > MAXREFCEXTLEN)
>  		goto out_bad_rec;
>  
> -	/* handle special COW-staging state */
> -	realstart = irec->rc_startblock;
> -	if (realstart & XFS_REFC_COW_START) {
> -		if (irec->rc_refcount != 1)
> -			goto out_bad_rec;
> -		realstart &= ~XFS_REFC_COW_START;
> -	} else if (irec->rc_refcount < 2) {
> +	/* handle special COW-staging domain */
> +	if (irec->rc_domain == XFS_RCDOM_COW && irec->rc_refcount != 1)
> +		goto out_bad_rec;
> +	if (irec->rc_domain == XFS_RCDOM_SHARED && irec->rc_refcount < 2)
>  		goto out_bad_rec;
> -	}
>  
>  	/* check for valid extent range, including overflow */
> -	if (!xfs_verify_agbno(pag, realstart))
> +	if (!xfs_verify_agbno(pag, irec->rc_startblock))
>  		goto out_bad_rec;
> -	if (realstart > realstart + irec->rc_blockcount)
> +	if (irec->rc_startblock > irec->rc_startblock + irec->rc_blockcount)
>  		goto out_bad_rec;
> -	if (!xfs_verify_agbno(pag, realstart + irec->rc_blockcount - 1))
> +	if (!xfs_verify_agbno(pag, irec->rc_startblock + irec->rc_blockcount - 1))
>  		goto out_bad_rec;
>  
>  	if (irec->rc_refcount == 0 || irec->rc_refcount > MAXREFCOUNT)
> @@ -169,12 +183,19 @@ xfs_refcount_update(
>  	struct xfs_refcount_irec	*irec)
>  {
>  	union xfs_btree_rec	rec;
> +	__u32			start;
>  	int			error;
>  
>  	trace_xfs_refcount_update(cur->bc_mp, cur->bc_ag.pag->pag_agno, irec);
> -	rec.refc.rc_startblock = cpu_to_be32(irec->rc_startblock);
> +
> +	start = irec->rc_startblock & ~XFS_REFC_COW_START;
> +	if (irec->rc_domain == XFS_RCDOM_COW)
> +		start |= XFS_REFC_COW_START;

Yeah, this definitely looks like we want a
xfs_refcount_domain_to_block() helper...

> +
> +	rec.refc.rc_startblock = cpu_to_be32(start);
>  	rec.refc.rc_blockcount = cpu_to_be32(irec->rc_blockcount);
>  	rec.refc.rc_refcount = cpu_to_be32(irec->rc_refcount);
> +
>  	error = xfs_btree_update(cur, &rec);
>  	if (error)
>  		trace_xfs_refcount_update_error(cur->bc_mp,
> @@ -196,9 +217,12 @@ xfs_refcount_insert(
>  	int				error;
>  
>  	trace_xfs_refcount_insert(cur->bc_mp, cur->bc_ag.pag->pag_agno, irec);
> +
>  	cur->bc_rec.rc.rc_startblock = irec->rc_startblock;
>  	cur->bc_rec.rc.rc_blockcount = irec->rc_blockcount;
>  	cur->bc_rec.rc.rc_refcount = irec->rc_refcount;
> +	cur->bc_rec.rc.rc_domain = irec->rc_domain;

Does a struct copy make more sense here now?

[.....]

> @@ -600,8 +636,6 @@ xfs_refcount_merge_right_extent(
>  	return error;
>  }
>  
> -#define XFS_FIND_RCEXT_SHARED	1
> -#define XFS_FIND_RCEXT_COW	2
>  /*
>   * Find the left extent and the one after it (cleft).  This function assumes
>   * that we've already split any extent crossing agbno.
> @@ -611,16 +645,16 @@ xfs_refcount_find_left_extents(
>  	struct xfs_btree_cur		*cur,
>  	struct xfs_refcount_irec	*left,
>  	struct xfs_refcount_irec	*cleft,
> +	enum xfs_rcext_domain		domain,
>  	xfs_agblock_t			agbno,
> -	xfs_extlen_t			aglen,
> -	int				flags)
> +	xfs_extlen_t			aglen)
>  {
>  	struct xfs_refcount_irec	tmp;
>  	int				error;
>  	int				found_rec;
>  
>  	left->rc_startblock = cleft->rc_startblock = NULLAGBLOCK;
> -	error = xfs_refcount_lookup_le(cur, agbno - 1, &found_rec);
> +	error = xfs_refcount_lookup_le(cur, domain, agbno - 1, &found_rec);
>  	if (error)
>  		goto out_error;
>  	if (!found_rec)
> @@ -634,11 +668,13 @@ xfs_refcount_find_left_extents(
>  		goto out_error;
>  	}
>  
> +	if (tmp.rc_domain != domain)
> +		return 0;
>  	if (xfs_refc_next(&tmp) != agbno)
>  		return 0;
> -	if ((flags & XFS_FIND_RCEXT_SHARED) && tmp.rc_refcount < 2)
> +	if (domain == XFS_RCDOM_SHARED && tmp.rc_refcount < 2)
>  		return 0;
> -	if ((flags & XFS_FIND_RCEXT_COW) && tmp.rc_refcount > 1)
> +	if (domain == XFS_RCDOM_COW && tmp.rc_refcount > 1)
>  		return 0;

Hmmm - this pattern is repeated in a couple of places. Perhaps a
helper is in order?

[....]

> diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> index 316c1ec0c3c2..b0818063aa20 100644
> --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> @@ -155,12 +155,31 @@ xfs_refcountbt_init_high_key_from_rec(
>  	key->refc.rc_startblock = cpu_to_be32(x);
>  }
>  
> +static inline __u32
> +xfs_refcountbt_encode_startblock(
> +	struct xfs_btree_cur	*cur)
> +{
> +	__u32			start;
> +
> +	/*
> +	 * low level btree operations need to handle the generic btree range
> +	 * query functions (which set rc_domain == -1U), so we check that the
> +	 * domain is /not/ shared.
> +	 */
> +	start = cur->bc_rec.rc.rc_startblock & ~XFS_REFC_COW_START;
> +	if (cur->bc_rec.rc.rc_domain != XFS_RCDOM_SHARED)
> +		start |= XFS_REFC_COW_START;
> +	return start;
> +}

Oh, there is a xfs_refcount_domain_to_block() helper already - it
just needs to be passed the agbno + domain, not a btree cursor :)

[....]

> diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c
> index 8ab55e791ea8..9cf4be9cbb89 100644
> --- a/fs/xfs/scrub/refcount.c
> +++ b/fs/xfs/scrub/refcount.c
> @@ -334,20 +334,19 @@ xchk_refcountbt_rec(
>  	struct xfs_refcount_irec irec;
>  	xfs_agblock_t		*cow_blocks = bs->private;
>  	struct xfs_perag	*pag = bs->cur->bc_ag.pag;
> -	bool			has_cowflag;
>  
>  	xfs_refcount_btrec_to_irec(rec, &irec);
>  
>  	/* Only CoW records can have refcount == 1. */
> -	has_cowflag = (irec.rc_startblock & XFS_REFC_COW_START);
> -	if ((irec.rc_refcount == 1 && !has_cowflag) ||
> -	    (irec.rc_refcount != 1 && has_cowflag))
> +	if (irec.rc_domain == XFS_RCDOM_SHARED && irec.rc_refcount == 1)
>  		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
> -	if (has_cowflag)
> +	if (irec.rc_domain == XFS_RCDOM_COW) {
> +		if (irec.rc_refcount != 1)
> +			xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
>  		(*cow_blocks) += irec.rc_blockcount;
> +	}

Hmmm, that looks like another of those shared/cow domain + refcount
checks like I pointed out above...

>  
>  	/* Check the extent. */
> -	irec.rc_startblock &= ~XFS_REFC_COW_START;
>  	if (irec.rc_startblock + irec.rc_blockcount <= irec.rc_startblock ||
>  	    !xfs_verify_agbno(pag, irec.rc_startblock) ||
>  	    !xfs_verify_agbno(pag, irec.rc_startblock + irec.rc_blockcount - 1))
> @@ -420,7 +419,6 @@ xchk_xref_is_cow_staging(
>  	xfs_extlen_t			len)
>  {
>  	struct xfs_refcount_irec	rc;
> -	bool				has_cowflag;
>  	int				has_refcount;
>  	int				error;
>  
> @@ -428,8 +426,8 @@ xchk_xref_is_cow_staging(
>  		return;
>  
>  	/* Find the CoW staging extent. */
> -	error = xfs_refcount_lookup_le(sc->sa.refc_cur,
> -			agbno + XFS_REFC_COW_START, &has_refcount);
> +	error = xfs_refcount_lookup_le(sc->sa.refc_cur, XFS_RCDOM_COW, agbno,
> +			&has_refcount);
>  	if (!xchk_should_check_xref(sc, &error, &sc->sa.refc_cur))
>  		return;
>  	if (!has_refcount) {
> @@ -446,8 +444,7 @@ xchk_xref_is_cow_staging(
>  	}
>  
>  	/* CoW flag must be set, refcount must be 1. */
> -	has_cowflag = (rc.rc_startblock & XFS_REFC_COW_START);
> -	if (!has_cowflag || rc.rc_refcount != 1)
> +	if (rc.rc_domain != XFS_RCDOM_COW || rc.rc_refcount != 1)
>  		xchk_btree_xref_set_corrupt(sc, sc->sa.refc_cur, 0);

That looks like the same validation logic check as the above hunk
in xchk_refcountbt_rec()...

That's just a first pass - I haven't really concentrated on
correctness that much yet, the patch is really doing too much for me
to get a good picture of everything...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux