On Wed, Oct 26, 2022 at 11:40:29AM +1100, Dave Chinner wrote: > On Mon, Oct 24, 2022 at 02:33:25PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > Just prior to committing the reflink code into upstream, the xfs > > maintainer at the time requested that I find a way to shard the refcount > > records into two domains -- one for records tracking shared extents, and > > a second for tracking CoW staging extents. The idea here was to > > minimize mount time CoW reclamation by pushing all the CoW records to > > the right edge of the keyspace, and it was accomplished by setting the > > upper bit in rc_startblock. We don't allow AGs to have more than 2^31 > > blocks, so the bit was free. > > > > Unfortunately, this was a very late addition to the codebase, so most of > > the refcount record processing code still treats rc_startblock as a u32 > > and pays no attention to whether or not the upper bit (the cow flag) is > > set. This is a weakness is theoretically exploitable, since we're not > > fully validating the incoming metadata records. > > > > Fuzzing demonstrates practical exploits of this weakness. If the cow > > flag of a node block key record is corrupted, a lookup operation can go > > to the wrong record block and start returning records from the wrong > > cow/shared domain. This causes the math to go all wrong (since cow > > domain is still implicit in the upper bit of rc_startblock) and we can > > crash the kernel by tricking xfs into jumping into a nonexistent AG and > > tripping over xfs_perag_get(mp, <nonexistent AG>) returning NULL. > > > > To fix this, start tracking the domain as an explicit part of struct > > xfs_refcount_irec, adjust all refcount functions to check the domain > > of a returned record, and alter the function definitions to accept them > > where necessary. > > > > Found by fuzzing keys[2].cowflag = add in xfs/464. > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > --- > > fs/xfs/libxfs/xfs_refcount.c | 221 ++++++++++++++++++++++++------------ > > fs/xfs/libxfs/xfs_refcount.h | 9 + > > fs/xfs/libxfs/xfs_refcount_btree.c | 26 ++++ > > fs/xfs/libxfs/xfs_types.h | 10 ++ > > fs/xfs/scrub/refcount.c | 22 ++-- > > fs/xfs/xfs_trace.h | 48 ++++++-- > > 6 files changed, 235 insertions(+), 101 deletions(-) > > My first thought was that this is complex enough that it needs to be > split up. Reading the very first hunk: > > > > > > > diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c > > index 64b910caafaa..fa75e785652b 100644 > > --- a/fs/xfs/libxfs/xfs_refcount.c > > +++ b/fs/xfs/libxfs/xfs_refcount.c > > @@ -46,13 +46,16 @@ STATIC int __xfs_refcount_cow_free(struct xfs_btree_cur *rcur, > > int > > xfs_refcount_lookup_le( > > struct xfs_btree_cur *cur, > > + enum xfs_rcext_domain domain, > > xfs_agblock_t bno, > > int *stat) > > { > > - trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno, bno, > > + trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno, > > + bno | (domain == XFS_RCDOM_COW ? XFS_REFC_COW_START : 0), > > XFS_LOOKUP_LE); > > cur->bc_rec.rc.rc_startblock = bno; > > cur->bc_rec.rc.rc_blockcount = 0; > > + cur->bc_rec.rc.rc_domain = domain; > > return xfs_btree_lookup(cur, XFS_LOOKUP_LE, stat); > > I found that I had to go looking through the patch to find what the > definitions of xfs_rcext_domain, XFS_RCDOM_COW and rc_domain > actually were, because without knowing what the actual new > structures being introduced actually were this didn't make a whole > lot of sense. <nod> Sorry about that. > Hence I think I'd like this to be broken up into a patch that > introduces the rc_domain and the helpers that build/split the > domain/startblock information in the irec, and a followup patch that > modifies the implementation to use the new domain interfaces to make > it a bit easier to see where the significant changes in the code > actually are. Ok, I've done that. The mechanical change itself is still pretty large, but I think it's a bit more straightforward. The tracepoint updtes and the change to _get_rec callers to compare the desired domain against the one returned are each separate patches. > I also suspect that trace_xfs_refcount_lookup() should be passed the > domain directly rather than encoding it at every call site of the > tracepoint, or it code encoded in a helper such as > xfs_refcount_domain_to_block()... Done. > For consistency, calling the enums XFS_REFC_DOMAIN_{COW/SHARED} > would be more consistent with the naming used in other refcount > btree constants. Done. > > } > > > > @@ -63,13 +66,16 @@ xfs_refcount_lookup_le( > > int > > xfs_refcount_lookup_ge( > > struct xfs_btree_cur *cur, > > + enum xfs_rcext_domain domain, > > xfs_agblock_t bno, > > int *stat) > > { > > - trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno, bno, > > + trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno, > > + bno | (domain == XFS_RCDOM_COW ? XFS_REFC_COW_START : 0), > > XFS_LOOKUP_GE); > > cur->bc_rec.rc.rc_startblock = bno; > > cur->bc_rec.rc.rc_blockcount = 0; > > + cur->bc_rec.rc.rc_domain = domain; > > return xfs_btree_lookup(cur, XFS_LOOKUP_GE, stat); > > } > > > > @@ -80,13 +86,16 @@ xfs_refcount_lookup_ge( > > int > > xfs_refcount_lookup_eq( > > struct xfs_btree_cur *cur, > > + enum xfs_rcext_domain domain, > > xfs_agblock_t bno, > > int *stat) > > { > > - trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno, bno, > > + trace_xfs_refcount_lookup(cur->bc_mp, cur->bc_ag.pag->pag_agno, > > + bno | (domain == XFS_RCDOM_COW ? XFS_REFC_COW_START : 0), > > XFS_LOOKUP_LE); > > cur->bc_rec.rc.rc_startblock = bno; > > cur->bc_rec.rc.rc_blockcount = 0; > > + cur->bc_rec.rc.rc_domain = domain; > > return xfs_btree_lookup(cur, XFS_LOOKUP_EQ, stat); > > } > > > > @@ -96,7 +105,17 @@ xfs_refcount_btrec_to_irec( > > const union xfs_btree_rec *rec, > > struct xfs_refcount_irec *irec) > > { > > - irec->rc_startblock = be32_to_cpu(rec->refc.rc_startblock); > > + __u32 start; > > + > > + start = be32_to_cpu(rec->refc.rc_startblock); > > + if (start & XFS_REFC_COW_START) { > > + start &= ~XFS_REFC_COW_START; > > + irec->rc_domain = XFS_RCDOM_COW; > > + } else { > > + irec->rc_domain = XFS_RCDOM_SHARED; > > + } > > xfs_refcount_block_to_domain()? This is the only place where this happens, so I've left it this way. > > > > + > > + irec->rc_startblock = start; > > irec->rc_blockcount = be32_to_cpu(rec->refc.rc_blockcount); > > irec->rc_refcount = be32_to_cpu(rec->refc.rc_refcount); > > } > > @@ -114,7 +133,6 @@ xfs_refcount_get_rec( > > struct xfs_perag *pag = cur->bc_ag.pag; > > union xfs_btree_rec *rec; > > int error; > > - xfs_agblock_t realstart; > > > > error = xfs_btree_get_rec(cur, &rec, stat); > > if (error || !*stat) > > @@ -124,22 +142,18 @@ xfs_refcount_get_rec( > > if (irec->rc_blockcount == 0 || irec->rc_blockcount > MAXREFCEXTLEN) > > goto out_bad_rec; > > > > - /* handle special COW-staging state */ > > - realstart = irec->rc_startblock; > > - if (realstart & XFS_REFC_COW_START) { > > - if (irec->rc_refcount != 1) > > - goto out_bad_rec; > > - realstart &= ~XFS_REFC_COW_START; > > - } else if (irec->rc_refcount < 2) { > > + /* handle special COW-staging domain */ > > + if (irec->rc_domain == XFS_RCDOM_COW && irec->rc_refcount != 1) > > + goto out_bad_rec; > > + if (irec->rc_domain == XFS_RCDOM_SHARED && irec->rc_refcount < 2) > > goto out_bad_rec; > > - } > > > > /* check for valid extent range, including overflow */ > > - if (!xfs_verify_agbno(pag, realstart)) > > + if (!xfs_verify_agbno(pag, irec->rc_startblock)) > > goto out_bad_rec; > > - if (realstart > realstart + irec->rc_blockcount) > > + if (irec->rc_startblock > irec->rc_startblock + irec->rc_blockcount) > > goto out_bad_rec; > > - if (!xfs_verify_agbno(pag, realstart + irec->rc_blockcount - 1)) > > + if (!xfs_verify_agbno(pag, irec->rc_startblock + irec->rc_blockcount - 1)) > > goto out_bad_rec; > > > > if (irec->rc_refcount == 0 || irec->rc_refcount > MAXREFCOUNT) > > @@ -169,12 +183,19 @@ xfs_refcount_update( > > struct xfs_refcount_irec *irec) > > { > > union xfs_btree_rec rec; > > + __u32 start; > > int error; > > > > trace_xfs_refcount_update(cur->bc_mp, cur->bc_ag.pag->pag_agno, irec); > > - rec.refc.rc_startblock = cpu_to_be32(irec->rc_startblock); > > + > > + start = irec->rc_startblock & ~XFS_REFC_COW_START; > > + if (irec->rc_domain == XFS_RCDOM_COW) > > + start |= XFS_REFC_COW_START; > > Yeah, this definitely looks like we want a > xfs_refcount_domain_to_block() helper... Yep. > > + > > + rec.refc.rc_startblock = cpu_to_be32(start); > > rec.refc.rc_blockcount = cpu_to_be32(irec->rc_blockcount); > > rec.refc.rc_refcount = cpu_to_be32(irec->rc_refcount); > > + > > error = xfs_btree_update(cur, &rec); > > if (error) > > trace_xfs_refcount_update_error(cur->bc_mp, > > @@ -196,9 +217,12 @@ xfs_refcount_insert( > > int error; > > > > trace_xfs_refcount_insert(cur->bc_mp, cur->bc_ag.pag->pag_agno, irec); > > + > > cur->bc_rec.rc.rc_startblock = irec->rc_startblock; > > cur->bc_rec.rc.rc_blockcount = irec->rc_blockcount; > > cur->bc_rec.rc.rc_refcount = irec->rc_refcount; > > + cur->bc_rec.rc.rc_domain = irec->rc_domain; > > Does a struct copy make more sense here now? Perhaps, but not in a bugfix patchset. > [.....] > > > @@ -600,8 +636,6 @@ xfs_refcount_merge_right_extent( > > return error; > > } > > > > -#define XFS_FIND_RCEXT_SHARED 1 > > -#define XFS_FIND_RCEXT_COW 2 > > /* > > * Find the left extent and the one after it (cleft). This function assumes > > * that we've already split any extent crossing agbno. > > @@ -611,16 +645,16 @@ xfs_refcount_find_left_extents( > > struct xfs_btree_cur *cur, > > struct xfs_refcount_irec *left, > > struct xfs_refcount_irec *cleft, > > + enum xfs_rcext_domain domain, > > xfs_agblock_t agbno, > > - xfs_extlen_t aglen, > > - int flags) > > + xfs_extlen_t aglen) > > { > > struct xfs_refcount_irec tmp; > > int error; > > int found_rec; > > > > left->rc_startblock = cleft->rc_startblock = NULLAGBLOCK; > > - error = xfs_refcount_lookup_le(cur, agbno - 1, &found_rec); > > + error = xfs_refcount_lookup_le(cur, domain, agbno - 1, &found_rec); > > if (error) > > goto out_error; > > if (!found_rec) > > @@ -634,11 +668,13 @@ xfs_refcount_find_left_extents( > > goto out_error; > > } > > > > + if (tmp.rc_domain != domain) > > + return 0; > > if (xfs_refc_next(&tmp) != agbno) > > return 0; > > - if ((flags & XFS_FIND_RCEXT_SHARED) && tmp.rc_refcount < 2) > > + if (domain == XFS_RCDOM_SHARED && tmp.rc_refcount < 2) > > return 0; > > - if ((flags & XFS_FIND_RCEXT_COW) && tmp.rc_refcount > 1) > > + if (domain == XFS_RCDOM_COW && tmp.rc_refcount > 1) > > return 0; > > Hmmm - this pattern is repeated in a couple of places. Perhaps a > helper is in order? Ok, I'll clean this up. Actually -- upon second examination, these two if tests aren't needed anymore, because they were a rudimentary means of ignoring records from the wrong domain. We added an explicit domain check above, so this can go away entirely. _get_rec will return -EFSCORRUPTED if the domain/refcount are inconsistent, so we're protected there. > [....] > > > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c > > index 316c1ec0c3c2..b0818063aa20 100644 > > --- a/fs/xfs/libxfs/xfs_refcount_btree.c > > +++ b/fs/xfs/libxfs/xfs_refcount_btree.c > > @@ -155,12 +155,31 @@ xfs_refcountbt_init_high_key_from_rec( > > key->refc.rc_startblock = cpu_to_be32(x); > > } > > > > +static inline __u32 > > +xfs_refcountbt_encode_startblock( > > + struct xfs_btree_cur *cur) > > +{ > > + __u32 start; > > + > > + /* > > + * low level btree operations need to handle the generic btree range > > + * query functions (which set rc_domain == -1U), so we check that the > > + * domain is /not/ shared. > > + */ > > + start = cur->bc_rec.rc.rc_startblock & ~XFS_REFC_COW_START; > > + if (cur->bc_rec.rc.rc_domain != XFS_RCDOM_SHARED) > > + start |= XFS_REFC_COW_START; > > + return start; > > +} > > Oh, there is a xfs_refcount_domain_to_block() helper already - it > just needs to be passed the agbno + domain, not a btree cursor :) Done. > [....] > > > diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c > > index 8ab55e791ea8..9cf4be9cbb89 100644 > > --- a/fs/xfs/scrub/refcount.c > > +++ b/fs/xfs/scrub/refcount.c > > @@ -334,20 +334,19 @@ xchk_refcountbt_rec( > > struct xfs_refcount_irec irec; > > xfs_agblock_t *cow_blocks = bs->private; > > struct xfs_perag *pag = bs->cur->bc_ag.pag; > > - bool has_cowflag; > > > > xfs_refcount_btrec_to_irec(rec, &irec); > > > > /* Only CoW records can have refcount == 1. */ > > - has_cowflag = (irec.rc_startblock & XFS_REFC_COW_START); > > - if ((irec.rc_refcount == 1 && !has_cowflag) || > > - (irec.rc_refcount != 1 && has_cowflag)) > > + if (irec.rc_domain == XFS_RCDOM_SHARED && irec.rc_refcount == 1) > > xchk_btree_set_corrupt(bs->sc, bs->cur, 0); > > - if (has_cowflag) > > + if (irec.rc_domain == XFS_RCDOM_COW) { > > + if (irec.rc_refcount != 1) > > + xchk_btree_set_corrupt(bs->sc, bs->cur, 0); > > (*cow_blocks) += irec.rc_blockcount; > > + } > > Hmmm, that looks like another of those shared/cow domain + refcount > checks like I pointed out above... Yep. Separate cleanup patch. > > > > /* Check the extent. */ > > - irec.rc_startblock &= ~XFS_REFC_COW_START; > > if (irec.rc_startblock + irec.rc_blockcount <= irec.rc_startblock || > > !xfs_verify_agbno(pag, irec.rc_startblock) || > > !xfs_verify_agbno(pag, irec.rc_startblock + irec.rc_blockcount - 1)) > > @@ -420,7 +419,6 @@ xchk_xref_is_cow_staging( > > xfs_extlen_t len) > > { > > struct xfs_refcount_irec rc; > > - bool has_cowflag; > > int has_refcount; > > int error; > > > > @@ -428,8 +426,8 @@ xchk_xref_is_cow_staging( > > return; > > > > /* Find the CoW staging extent. */ > > - error = xfs_refcount_lookup_le(sc->sa.refc_cur, > > - agbno + XFS_REFC_COW_START, &has_refcount); > > + error = xfs_refcount_lookup_le(sc->sa.refc_cur, XFS_RCDOM_COW, agbno, > > + &has_refcount); > > if (!xchk_should_check_xref(sc, &error, &sc->sa.refc_cur)) > > return; > > if (!has_refcount) { > > @@ -446,8 +444,7 @@ xchk_xref_is_cow_staging( > > } > > > > /* CoW flag must be set, refcount must be 1. */ > > - has_cowflag = (rc.rc_startblock & XFS_REFC_COW_START); > > - if (!has_cowflag || rc.rc_refcount != 1) > > + if (rc.rc_domain != XFS_RCDOM_COW || rc.rc_refcount != 1) > > xchk_btree_xref_set_corrupt(sc, sc->sa.refc_cur, 0); > > That looks like the same validation logic check as the above hunk > in xchk_refcountbt_rec()... ...but _get_rec will check this for us, so this check collapses to: if (rc.rc_domain != XFS_REFC_DOMAIN_COW) xchk_btree_xref_set_corrupt(...); > That's just a first pass - I haven't really concentrated on > correctness that much yet, the patch is really doing too much for me > to get a good picture of everything... <nod> I'll have a new patchset with changes broken up into smaller pieces tomorrow. --D > -Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx