On Wed, Oct 13, 2021 at 06:25:21PM +1100, Dave Chinner wrote: > On Tue, Oct 12, 2021 at 04:33:34PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big > > enough to handle the maximally tall rmap btree when all blocks are in > > use and maximally shared, let's compute the maximum height assuming the > > rmapbt consumes as many blocks as possible. > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > Reviewed-by: Chandan Babu R <chandan.babu@xxxxxxxxxx> > > --- > > fs/xfs/libxfs/xfs_btree.c | 34 ++++++++++++++++++++++++ > > fs/xfs/libxfs/xfs_btree.h | 2 + > > fs/xfs/libxfs/xfs_rmap_btree.c | 55 ++++++++++++++++++++++++--------------- > > fs/xfs/libxfs/xfs_trans_resv.c | 13 +++++++++ > > fs/xfs/libxfs/xfs_trans_space.h | 7 +++++ > > 5 files changed, 90 insertions(+), 21 deletions(-) > > > > > > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c > > index 6ced8f028d47..201b81d54622 100644 > > --- a/fs/xfs/libxfs/xfs_btree.c > > +++ b/fs/xfs/libxfs/xfs_btree.c > > @@ -4531,6 +4531,40 @@ xfs_btree_compute_maxlevels( > > return level; > > } > > > > +/* > > + * Compute the maximum height of a btree that is allowed to consume up to the > > + * given number of blocks. > > + */ > > +unsigned int > > +xfs_btree_compute_maxlevels_size( > > + unsigned long long max_btblocks, > > + unsigned int leaf_mnr) > > So "leaf_mnr" is supposed to be the minimum number of records in > a leaf? > > But this gets passed mp->m_rmap_mnr[1], which is the minimum number > of keys/ptrs in a node, not a leaf. I'm confused. That should have been "node_mnr". Sorry. :( > > +{ > > + unsigned long long leaf_blocks = leaf_mnr; > > + unsigned long long blocks_left; > > + unsigned int maxlevels; > > + > > + if (max_btblocks < 1) > > + return 0; > > + > > + /* > > + * The loop increments maxlevels as long as there would be enough > > + * blocks left in the reservation to handle each node block at the > > + * current level pointing to the minimum possible number of leaf blocks > > + * at the next level down. We start the loop assuming a single-level > > + * btree consuming one block. > > + */ > > + maxlevels = 1; > > + blocks_left = max_btblocks - 1; > > + while (leaf_blocks < blocks_left) { > > + maxlevels++; > > + blocks_left -= leaf_blocks; > > + leaf_blocks *= leaf_mnr; > > + } > > + > > + return maxlevels; > > Yup, I'm definitely confused. We also have: > > xfs_btree_calc_size(limits, len) > xfs_btree_compute_maxlevels(limits, len) > > And they do something similar but subtly different. They aren't > clearly documented, either, so from reading the code: > > xfs_btree_calc_size is calculating the btree block usage for a > discrete count of items based on the leaf and node population values > from mp->m_rmap_mnr, etc. It uses a division based algorithm > > recs = limits[0] // min recs per block > for (level = 0; len > 1; level++) { > do_div(len, recs) > recs = limits[1] // min ptrs per node > rval += len; > } > return rval > > (why does this even calculate level?) > > So it returns the number of blocks the btree will consume to > index a given number of discrete blocks. > > xfs_btree_compute_maxlevels() is basically: > > len = len / limits[0] // record blocks in level 0 > for (level = 1; len > 1; level++) > len = len / limits[1] // node blocks in level n > return level > > So it returns how many levels are required to index a specific > number of discrete blocks given a specific leaf/node population. > > But what does xfs_btree_compute_maxlevels_size do? I'm really not > sure from the desription, the calculation or the parameters passed > to it. Even a table doesn't tell me: > > say 10000 records, leaf_mnr = 10 > > loop blocks_left leaf_blocks max_levels > 0 (at init) 9999 10 1 > 1 9989 100 2 > 2 9889 1000 3 > 3 8889 10000 4 > Breaks out on (leaf_blocks > blocks_left) > > So, after much head scratching, I *think* what this function is > trying to do is take into account the case where we have a single > block shared by reflink N times, such that the entire AG is made up > of rmap records pointing to all the owners. We're trying to > determine the size is the height of the tree if we index enough leaf > records to consume all the free space in the AG? > > Which then means we don't care what the number of records are in the > leaf nodes, we only need to know how many leaf blocks there are and > how many interior nodes we consume to index them? > > IOWs, we're counting the number of leaf blocks we can index at each > level based on the _minimum number of pointers_ we can hold in a > _node_? Yes. > If so, then the naming leaves a lot to be desired here. The > variables all being named "leaf" even though they are being passed > node limits and are calculating node level indexing limits and not > leaf space consumption completely threw me in the wrong direction. > I just spent the best part of 90 minutes working all this out > from first principles because nothing is obvious about why this code > is correct. Everything screamed "wrong wrong wrong" at me until > I finally understood what was being calculated. And now I know, it > still screams "wrong wrong wrong" at me. > > So: > > /* > * Given a number of available blocks for the btree to consume with > * records and pointers, calculate the height of the tree needed to > * index all the records that space can hold based on the number of > * pointers each interior node holds. > * > * We start by assuming a single level tree consumes a single block, > * then track the number of blocks each node level consumes until we > * no longer have space to store the next node level. At this point, > * we are indexing all the leaf blocks in the space, and there's no > * more free space to split the tree any further. That's our maximum > * btree height. Ah, yes, that's a much better description and name than the ones I put on the function. > */ > unsigned int > xfs_btree_space_to_height( > unsigned int *limits, > unsigned long long leaf_blocks) > { > unsigned long long node_blocks = limits[1]; > unsigned long long blocks_left = leaf_blocks - 1; > unsigned int height = 1; > > if (leaf_blocks < 1) > return 0; > > while (node_blocks < blocks_left) { > height++; > blocks_left -= node_blocks; > node_blocks *= limits[1]; > } > > return height; > } > > Oh, yeah, I made the parameters the same as the other btree > height/size functions, too, because.... > > > + unsigned int val; > > + > > + if (!xfs_has_rmapbt(mp)) { > > + mp->m_rmap_maxlevels = 0; > > + return; > > + } > > + > > + if (xfs_has_reflink(mp)) { > > + /* > > + * Compute the asymptotic maxlevels for an rmap btree on a > > + * filesystem that supports reflink. > > + * > > + * On a reflink filesystem, each AG block can have up to 2^32 > > + * (per the refcount record format) owners, which means that > > + * theoretically we could face up to 2^64 rmap records. > > + * However, we're likely to run out of blocks in the AG long > > + * before that happens, which means that we must compute the > > + * max height based on what the btree will look like if it > > + * consumes almost all the blocks in the AG due to maximal > > + * sharing factor. > > + */ > > + val = xfs_btree_compute_maxlevels_size(mp->m_sb.sb_agblocks, > > + mp->m_rmap_mnr[1]); > > + } else { > > + /* > > + * If there's no block sharing, compute the maximum rmapbt > > + * height assuming one rmap record per AG block. > > + */ > > + val = xfs_btree_compute_maxlevels(mp->m_rmap_mnr, > > + mp->m_sb.sb_agblocks); > > This just looks weird with the same parameters in reverse order to > these two functions... TBH I intentionally reversed the order to make it obvious which was which, so we wouldn't end up with... uint xfs_btree_compute_maxlevels(uint *limits, unsigned int len); uint xfs_btree_space_to_height(uint *limits, unsigned int blocks); uint xfs_btree_calc_size(uint *limits, unsigned int len); ...three functions with the same type signatures. Three years have flown by since I wrote this patch, and now the signatures have diverged enough to make it at least somewhat distinct. IOWs, I'll adopt your version. :) > > + } > > + > > + mp->m_rmap_maxlevels = val; > > } > > Also, this function becomes simpler if it just returns the maxlevels > value and the caller writes it into mp->m_rmap_maxlevels. Done. > > > > /* Calculate the refcount btree size for some records. */ > > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c > > index 5e300daa2559..97bd17d84a23 100644 > > --- a/fs/xfs/libxfs/xfs_trans_resv.c > > +++ b/fs/xfs/libxfs/xfs_trans_resv.c > > @@ -814,6 +814,16 @@ xfs_trans_resv_calc( > > struct xfs_mount *mp, > > struct xfs_trans_resv *resp) > > { > > + unsigned int rmap_maxlevels = mp->m_rmap_maxlevels; > > + > > + /* > > + * In the early days of rmap+reflink, we always set the rmap maxlevels > > + * to 9 even if the AG was small enough that it would never grow to > > + * that height. > > + */ > > + if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp)) > > + mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS; > > + > > /* > > * The following transactions are logged in physical format and > > * require a permanent reservation on space. > > @@ -916,4 +926,7 @@ xfs_trans_resv_calc( > > resp->tr_clearagi.tr_logres = xfs_calc_clear_agi_bucket_reservation(mp); > > resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp); > > resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp); > > + > > + /* Put everything back the way it was. This goes at the end. */ > > + mp->m_rmap_maxlevels = rmap_maxlevels; > > } > > Why play games like this? We want the reservations to go down in > size if the btrees don't reach "XFS_OLD_REFLINK_RMAP_MAXLEVELS" > size. The reason isn't mentioned in the commit message... I think I'll record the reason why in the code itself. /* * In the early days of rmap+reflink, we always set the rmap * maxlevels to 9 even if the AG was small enough that it would * never grow to that height. Transaction reservation sizes * influence the minimum log size calculation, which influences * the size of the log that mkfs creates. Use the old value * here to ensure that newly formatted small filesystems will * mount on older kernels. */ if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp)) mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS; > > > diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h > > index 50332be34388..440c9c390b86 100644 > > --- a/fs/xfs/libxfs/xfs_trans_space.h > > +++ b/fs/xfs/libxfs/xfs_trans_space.h > > @@ -17,6 +17,13 @@ > > /* Adding one rmap could split every level up to the top of the tree. */ > > #define XFS_RMAPADD_SPACE_RES(mp) ((mp)->m_rmap_maxlevels) > > > > +/* > > + * Note that we historically set m_rmap_maxlevels to 9 when reflink was > > + * enabled, so we must preserve this behavior to avoid changing the transaction > > + * space reservations. > > + */ > > +#define XFS_OLD_REFLINK_RMAP_MAXLEVELS (9) > > 9. Assuming you meant '9 without the parentheses' here, fixed. Thanks for slogging through all that blocks_to_height stuff. :) --D > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx