On Tue, Oct 12, 2021 at 04:33:34PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big > enough to handle the maximally tall rmap btree when all blocks are in > use and maximally shared, let's compute the maximum height assuming the > rmapbt consumes as many blocks as possible. > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > Reviewed-by: Chandan Babu R <chandan.babu@xxxxxxxxxx> > --- > fs/xfs/libxfs/xfs_btree.c | 34 ++++++++++++++++++++++++ > fs/xfs/libxfs/xfs_btree.h | 2 + > fs/xfs/libxfs/xfs_rmap_btree.c | 55 ++++++++++++++++++++++++--------------- > fs/xfs/libxfs/xfs_trans_resv.c | 13 +++++++++ > fs/xfs/libxfs/xfs_trans_space.h | 7 +++++ > 5 files changed, 90 insertions(+), 21 deletions(-) > > > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c > index 6ced8f028d47..201b81d54622 100644 > --- a/fs/xfs/libxfs/xfs_btree.c > +++ b/fs/xfs/libxfs/xfs_btree.c > @@ -4531,6 +4531,40 @@ xfs_btree_compute_maxlevels( > return level; > } > > +/* > + * Compute the maximum height of a btree that is allowed to consume up to the > + * given number of blocks. > + */ > +unsigned int > +xfs_btree_compute_maxlevels_size( > + unsigned long long max_btblocks, > + unsigned int leaf_mnr) So "leaf_mnr" is supposed to be the minimum number of records in a leaf? But this gets passed mp->m_rmap_mnr[1], which is the minimum number of keys/ptrs in a node, not a leaf. I'm confused. > +{ > + unsigned long long leaf_blocks = leaf_mnr; > + unsigned long long blocks_left; > + unsigned int maxlevels; > + > + if (max_btblocks < 1) > + return 0; > + > + /* > + * The loop increments maxlevels as long as there would be enough > + * blocks left in the reservation to handle each node block at the > + * current level pointing to the minimum possible number of leaf blocks > + * at the next level down. We start the loop assuming a single-level > + * btree consuming one block. > + */ > + maxlevels = 1; > + blocks_left = max_btblocks - 1; > + while (leaf_blocks < blocks_left) { > + maxlevels++; > + blocks_left -= leaf_blocks; > + leaf_blocks *= leaf_mnr; > + } > + > + return maxlevels; Yup, I'm definitely confused. We also have: xfs_btree_calc_size(limits, len) xfs_btree_compute_maxlevels(limits, len) And they do something similar but subtly different. They aren't clearly documented, either, so from reading the code: xfs_btree_calc_size is calculating the btree block usage for a discrete count of items based on the leaf and node population values from mp->m_rmap_mnr, etc. It uses a division based algorithm recs = limits[0] // min recs per block for (level = 0; len > 1; level++) { do_div(len, recs) recs = limits[1] // min ptrs per node rval += len; } return rval (why does this even calculate level?) So it returns the number of blocks the btree will consume to index a given number of discrete blocks. xfs_btree_compute_maxlevels() is basically: len = len / limits[0] // record blocks in level 0 for (level = 1; len > 1; level++) len = len / limits[1] // node blocks in level n return level So it returns how many levels are required to index a specific number of discrete blocks given a specific leaf/node population. But what does xfs_btree_compute_maxlevels_size do? I'm really not sure from the desription, the calculation or the parameters passed to it. Even a table doesn't tell me: say 10000 records, leaf_mnr = 10 loop blocks_left leaf_blocks max_levels 0 (at init) 9999 10 1 1 9989 100 2 2 9889 1000 3 3 8889 10000 4 Breaks out on (leaf_blocks > blocks_left) So, after much head scratching, I *think* what this function is trying to do is take into account the case where we have a single block shared by reflink N times, such that the entire AG is made up of rmap records pointing to all the owners. We're trying to determine the size is the height of the tree if we index enough leaf records to consume all the free space in the AG? Which then means we don't care what the number of records are in the leaf nodes, we only need to know how many leaf blocks there are and how many interior nodes we consume to index them? IOWs, we're counting the number of leaf blocks we can index at each level based on the _minimum number of pointers_ we can hold in a _node_? If so, then the naming leaves a lot to be desired here. The variables all being named "leaf" even though they are being passed node limits and are calculating node level indexing limits and not leaf space consumption completely threw me in the wrong direction. I just spent the best part of 90 minutes working all this out from first principles because nothing is obvious about why this code is correct. Everything screamed "wrong wrong wrong" at me until I finally understood what was being calculated. And now I know, it still screams "wrong wrong wrong" at me. So: /* * Given a number of available blocks for the btree to consume with * records and pointers, calculate the height of the tree needed to * index all the records that space can hold based on the number of * pointers each interior node holds. * * We start by assuming a single level tree consumes a single block, * then track the number of blocks each node level consumes until we * no longer have space to store the next node level. At this point, * we are indexing all the leaf blocks in the space, and there's no * more free space to split the tree any further. That's our maximum * btree height. */ unsigned int xfs_btree_space_to_height( unsigned int *limits, unsigned long long leaf_blocks) { unsigned long long node_blocks = limits[1]; unsigned long long blocks_left = leaf_blocks - 1; unsigned int height = 1; if (leaf_blocks < 1) return 0; while (node_blocks < blocks_left) { height++; blocks_left -= node_blocks; node_blocks *= limits[1]; } return height; } Oh, yeah, I made the parameters the same as the other btree height/size functions, too, because.... > + unsigned int val; > + > + if (!xfs_has_rmapbt(mp)) { > + mp->m_rmap_maxlevels = 0; > + return; > + } > + > + if (xfs_has_reflink(mp)) { > + /* > + * Compute the asymptotic maxlevels for an rmap btree on a > + * filesystem that supports reflink. > + * > + * On a reflink filesystem, each AG block can have up to 2^32 > + * (per the refcount record format) owners, which means that > + * theoretically we could face up to 2^64 rmap records. > + * However, we're likely to run out of blocks in the AG long > + * before that happens, which means that we must compute the > + * max height based on what the btree will look like if it > + * consumes almost all the blocks in the AG due to maximal > + * sharing factor. > + */ > + val = xfs_btree_compute_maxlevels_size(mp->m_sb.sb_agblocks, > + mp->m_rmap_mnr[1]); > + } else { > + /* > + * If there's no block sharing, compute the maximum rmapbt > + * height assuming one rmap record per AG block. > + */ > + val = xfs_btree_compute_maxlevels(mp->m_rmap_mnr, > + mp->m_sb.sb_agblocks); This just looks weird with the same parameters in reverse order to these two functions... > + } > + > + mp->m_rmap_maxlevels = val; > } Also, this function becomes simpler if it just returns the maxlevels value and the caller writes it into mp->m_rmap_maxlevels. > > /* Calculate the refcount btree size for some records. */ > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c > index 5e300daa2559..97bd17d84a23 100644 > --- a/fs/xfs/libxfs/xfs_trans_resv.c > +++ b/fs/xfs/libxfs/xfs_trans_resv.c > @@ -814,6 +814,16 @@ xfs_trans_resv_calc( > struct xfs_mount *mp, > struct xfs_trans_resv *resp) > { > + unsigned int rmap_maxlevels = mp->m_rmap_maxlevels; > + > + /* > + * In the early days of rmap+reflink, we always set the rmap maxlevels > + * to 9 even if the AG was small enough that it would never grow to > + * that height. > + */ > + if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp)) > + mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS; > + > /* > * The following transactions are logged in physical format and > * require a permanent reservation on space. > @@ -916,4 +926,7 @@ xfs_trans_resv_calc( > resp->tr_clearagi.tr_logres = xfs_calc_clear_agi_bucket_reservation(mp); > resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp); > resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp); > + > + /* Put everything back the way it was. This goes at the end. */ > + mp->m_rmap_maxlevels = rmap_maxlevels; > } Why play games like this? We want the reservations to go down in size if the btrees don't reach "XFS_OLD_REFLINK_RMAP_MAXLEVELS" size. The reason isn't mentioned in the commit message... > diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h > index 50332be34388..440c9c390b86 100644 > --- a/fs/xfs/libxfs/xfs_trans_space.h > +++ b/fs/xfs/libxfs/xfs_trans_space.h > @@ -17,6 +17,13 @@ > /* Adding one rmap could split every level up to the top of the tree. */ > #define XFS_RMAPADD_SPACE_RES(mp) ((mp)->m_rmap_maxlevels) > > +/* > + * Note that we historically set m_rmap_maxlevels to 9 when reflink was > + * enabled, so we must preserve this behavior to avoid changing the transaction > + * space reservations. > + */ > +#define XFS_OLD_REFLINK_RMAP_MAXLEVELS (9) 9. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx