On Fri, Sep 17, 2021 at 06:30:10PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > Replace the statically-sized btree cursor zone with dynamically sized > allocations so that we can reduce the memory overhead for per-AG bt > cursors while handling very tall btrees for rt metadata. Hmmmmm. We do a *lot* of btree cursor allocation and freeing under load. Keeping that in a single slab rather than using heap memory is a good idea for stuff like this for many reasons... I mean, if we are creating a million inodes a second, a rouch back-of-the-envelope calculation says we are doing 3-4 million btree cursor instantiations a second. That's a lot of short term churn on the heap that we don't really need to subject it to. And even a few extra instructions in a path called millions of times a second adds up to a lot of extra runtime overhead. So.... > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > --- > fs/xfs/libxfs/xfs_btree.c | 40 ++++++++++++++++++++++++++++++++-------- > fs/xfs/libxfs/xfs_btree.h | 2 -- > fs/xfs/xfs_super.c | 11 +---------- > 3 files changed, 33 insertions(+), 20 deletions(-) > > > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c > index 2486ba22c01d..f9516828a847 100644 > --- a/fs/xfs/libxfs/xfs_btree.c > +++ b/fs/xfs/libxfs/xfs_btree.c > @@ -23,11 +23,6 @@ > #include "xfs_btree_staging.h" > #include "xfs_ag.h" > > -/* > - * Cursor allocation zone. > - */ > -kmem_zone_t *xfs_btree_cur_zone; > - > /* > * Btree magic numbers. > */ > @@ -379,7 +374,7 @@ xfs_btree_del_cursor( > kmem_free(cur->bc_ops); > if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && cur->bc_ag.pag) > xfs_perag_put(cur->bc_ag.pag); > - kmem_cache_free(xfs_btree_cur_zone, cur); > + kmem_free(cur); > } > > /* > @@ -4927,6 +4922,32 @@ xfs_btree_has_more_records( > return block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK); > } > > +/* Compute the maximum allowed height for a given btree type. */ > +static unsigned int > +xfs_btree_maxlevels( > + struct xfs_mount *mp, > + xfs_btnum_t btnum) > +{ > + switch (btnum) { > + case XFS_BTNUM_BNO: > + case XFS_BTNUM_CNT: > + return mp->m_ag_maxlevels; > + case XFS_BTNUM_BMAP: > + return max(mp->m_bm_maxlevels[XFS_DATA_FORK], > + mp->m_bm_maxlevels[XFS_ATTR_FORK]); > + case XFS_BTNUM_INO: > + case XFS_BTNUM_FINO: > + return M_IGEO(mp)->inobt_maxlevels; > + case XFS_BTNUM_RMAP: > + return mp->m_rmap_maxlevels; > + case XFS_BTNUM_REFC: > + return mp->m_refc_maxlevels; > + default: > + ASSERT(0); > + return XFS_BTREE_MAXLEVELS; > + } > +} > + > /* Allocate a new btree cursor of the appropriate size. */ > struct xfs_btree_cur * > xfs_btree_alloc_cursor( > @@ -4935,13 +4956,16 @@ xfs_btree_alloc_cursor( > xfs_btnum_t btnum) > { > struct xfs_btree_cur *cur; > + unsigned int maxlevels = xfs_btree_maxlevels(mp, btnum); > > - cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL); > + ASSERT(maxlevels <= XFS_BTREE_MAXLEVELS); > + > + cur = kmem_zalloc(xfs_btree_cur_sizeof(maxlevels), KM_NOFS); Instead of multiple dynamic runtime calculations to determine the size to allocate from the heap, which then has to select a slab based on size, why don't we just pre-calculate the max size of the cursor at XFS module init and use that for the btree cursor slab size? The memory overhead of the cursor isn't an issue because we've been maximally sizing it since forever, and the whole point of a slab cache is to minimise allocation overhead of frequently allocated objects. It seems to me that we really want to retain these properties of the cursor allocator, not give them up just as we're in the process of making other modifications that will hit the path more frequently than it's ever been hit before... I like all the dynamic sized guards that this series places in the cursor, but I don't think we want to change the way we allocate the cursors just to support that. FWIW, an example of avoidable runtime calculation overhead of constants is xlog_calc_unit_res(). These values are actually constant for a given transaction reservation, but at 1.6 million transactions a second it shows up at #20 on the flat profile of functions using the most CPU: 0.71% [kernel] [k] xlog_calc_unit_res 0.71% of 32 CPUs for 1.6 million calculations a second of the same constants is a non-trivial amount of CPU time to spend doing unnecessary repeated calculations. Even though the btree cursor constant calculations are simpler than the log res calculations, they are more frequent. Hence on general principles of efficiency, I don't think we want to be replacing high frequency, low overhead slab/zone based allocations with heap allocations that require repeated constant calculations and size->slab redirection.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx