Re: [PATCH 11/14] xfs: dynamically allocate cursors based on maxlevels

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Sep 22, 2021 at 10:38:21AM -0700, Darrick J. Wong wrote:
> On Tue, Sep 21, 2021 at 09:06:35AM +1000, Dave Chinner wrote:
> > On Fri, Sep 17, 2021 at 06:30:10PM -0700, Darrick J. Wong wrote:
> > >  /* Allocate a new btree cursor of the appropriate size. */
> > >  struct xfs_btree_cur *
> > >  xfs_btree_alloc_cursor(
> > > @@ -4935,13 +4956,16 @@ xfs_btree_alloc_cursor(
> > >  	xfs_btnum_t		btnum)
> > >  {
> > >  	struct xfs_btree_cur	*cur;
> > > +	unsigned int		maxlevels = xfs_btree_maxlevels(mp, btnum);
> > >  
> > > -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> > > +	ASSERT(maxlevels <= XFS_BTREE_MAXLEVELS);
> > > +
> > > +	cur = kmem_zalloc(xfs_btree_cur_sizeof(maxlevels), KM_NOFS);
> > 
> > Instead of multiple dynamic runtime calculations to determine the
> > size to allocate from the heap, which then has to select a slab
> > based on size, why don't we just pre-calculate the max size of
> > the cursor at XFS module init and use that for the btree cursor slab
> > size?
> 
> As part of developing the realtime rmapbt and reflink btrees, I computed
> the maximum theoretical btree height for a maximally sized realtime
> volume.  For a realtime volume with 2^52 blocks and a 1k block size, I
> estimate that you'd need a 11-level rtrefcount btree cursor.  The rtrmap
> btree cursor would have to be 28 levels high.  Using 4k blocks instead
> of 1k blocks, it's not so bad -- 8 for rtrefcount and 17 for rtrmap.

I'm going to state straight out that 1k block sizes for the rt
device are insane. That's not what that device was intended to
support, ever. It was intended for workloads with -large-,
consistent extent sizes in large contiguous runs, not tiny, small
random allocations of individual blocks.

So if we are going to be talking about the overhead RT block
management for new functionality, we need to start by putting
reasonable limits on the block sizes that the RT device will support
such features for. Because while a btree might scale to 2^52 x 1kB
blocks, the RT allocation bitmap sure as hell doesn't. It probably
doesn't even scale at all well above a few million blocks for
general usage.

Hence I don't think it's worth optimising for these cases when we
think about maximum btree sizes for the cursors - those btrees can
provide their own cursor slab to allocate from if it comes to it.

Really, if we want to scale RT devices to insane sizes, we need to
move to an AG based structure for it which breaks up the bitmaps and
summary files into regions to keep the overhead and max sizes under
control.

> I don't recall exactly what Chandan said the maximum bmbt height would
> need to be to support really large data fork mapping structures, but
> based on my worst case estimate of 2^54 single-block mappings and a 1k
> blocksize, you'd need a 12-level bmbt cursor.  For 4k blocks, you'd need
> only 8 levels.

Yup, it's not significantly different to what we have now.

> The current XFS_BTREE_MAXLEVELS is 9, which just so happens to fit in
> 248 bytes.  I will rework this patch to make xfs_btree_cur_zone supply
> 256-byte cursors, and the btree code will continue using the zone if 256
> bytes is enough space for the cursor.
>
> If we decide later on that we need a zone for larger cursors, I think
> the next logical size up (512 bytes) will fit 25 levels, but let's wait
> to get there first.

I suspect you may misunderstand how SLUB caches work. SLUB packs
non-power of two sized slabs tightly to natural alignment (8 bytes).
e.g.:

$ sudo grep xfs_btree_cur /proc/slabinfo
xfs_btree_cur       1152   1152    224   36    2 : tunables    0 0    0 : slabdata     32     32      0

SLUB is using an order-1 base page (2 pages), with 36 cursor objects
in it. 36 * 224 = 8064 bytes, which means it is packed as tightly as
possible. It is not using 256 byte objects for these btree cursors.

If we allocate these 224 byte objects _from the heap_, however, then
the 256 byte heap slab will be selected, which means the object is
then padded to 256 bytes -by the heap-. The SLUB allocator does not
pad the objects, it's the heap granularity that adds padding to the
objects.

This implicit padding of heap objects is another reason we don't
want to use the heap for anything we frequently allocate or allocate
in large amounts. It can result in substantial amounts of wasted
memory.

IOWs, we don't actually care about object size granularity for slab
cache allocated objects.

However, if we really want to look at memory usage of struct
xfs_btree_cur, pahole tells me:

	/* size: 224, cachelines: 4, members: 13 */

Where are the extra 24 bytes coming from on your kernel?

It also tells me that a bunch of space that can be taken out of it:

- 4 byte hole that bc_btnum can be moved into.
- bc_blocklog is set but not used, so it can go, too.
- bc_ag.refc.nr_ops doesn't need to be an unsigned long
- optimising bc_ra state. That just tracks if
  the current cursor has already done sibling readahead - it's two
  bits per level , held in a int8_t per level. Could be a pair of
  int16_t bitmasks if maxlevel is 12, that would save another 8
  bytes. If maxlevel == 28 as per the rt case above, then a pair of
  int32_t bitmasks saves 4 bytes for 12 levels and 20 bytes bytes
  for 28 levels...

Hence if we're concerned about space usage of the btree cursor,
these seem like low hanging fruit.

Maybe the best thing here, as Christoph mentioned, is to have a set
of btree cursor zones for the different size limits. All the per-ag
btrees have the same (small) size limits, while the BMBT is bigger.
And the RT btrees when they arrive will be bigger again. Given that
we already allocate the cursors based on the type of btree they are
going to walk, this seems like it would be pretty easy to do,
something like the patch below, perhaps?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

xfs: per-btree cursor slab caches
---
 fs/xfs/libxfs/xfs_alloc_btree.c    |  3 ++-
 fs/xfs/libxfs/xfs_bmap_btree.c     |  4 +++-
 fs/xfs/libxfs/xfs_btree.c          | 28 +++++++++++++++++++++++-----
 fs/xfs/libxfs/xfs_btree.h          |  6 +++++-
 fs/xfs/libxfs/xfs_ialloc_btree.c   |  4 +++-
 fs/xfs/libxfs/xfs_refcount_btree.c |  4 +++-
 fs/xfs/libxfs/xfs_rmap_btree.c     |  4 +++-
 fs/xfs/xfs_super.c                 | 30 ++++++++++++++++++++++++++----
 8 files changed, 68 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 6746fd735550..53ead7b98238 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -20,6 +20,7 @@
 #include "xfs_trans.h"
 #include "xfs_ag.h"
 
+struct kmem_cache	*xfs_allocbt_cur_zone;
 
 STATIC struct xfs_btree_cur *
 xfs_allocbt_dup_cursor(
@@ -477,7 +478,7 @@ xfs_allocbt_init_common(
 
 	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
+	cur = kmem_cache_zalloc(xfs_allocbt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
 
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 72444b8b38a6..e3f7107ce2e2 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -22,6 +22,8 @@
 #include "xfs_trace.h"
 #include "xfs_rmap.h"
 
+struct kmem_cache	*xfs_bmbt_cur_zone;
+
 /*
  * Convert on-disk form of btree root to in-memory form.
  */
@@ -552,7 +554,7 @@ xfs_bmbt_init_cursor(
 	struct xfs_btree_cur	*cur;
 	ASSERT(whichfork != XFS_COW_FORK);
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
+	cur = kmem_cache_zalloc(xfs_bmbt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
 
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 298395481713..7ef19f365e33 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -23,10 +23,6 @@
 #include "xfs_btree_staging.h"
 #include "xfs_ag.h"
 
-/*
- * Cursor allocation zone.
- */
-kmem_zone_t	*xfs_btree_cur_zone;
 
 /*
  * Btree magic numbers.
@@ -379,7 +375,29 @@ xfs_btree_del_cursor(
 		kmem_free(cur->bc_ops);
 	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && cur->bc_ag.pag)
 		xfs_perag_put(cur->bc_ag.pag);
-	kmem_cache_free(xfs_btree_cur_zone, cur);
+
+	switch (cur->bc_btnum) {
+	case XFS_BTNUM_BMAP:
+		kmem_cache_free(xfs_bmbt_cur_zone, cur);
+		break;
+	case XFS_BTNUM_BNO:
+	case XFS_BTNUM_CNT:
+		kmem_cache_free(xfs_allocbt_cur_zone, cur);
+		break;
+	case XFS_BTNUM_INOBT:
+	case XFS_BTNUM_FINOBT:
+		kmem_cache_free(xfs_inobt_cur_zone, cur);
+		break;
+	case XFS_BTNUM_RMAP:
+		kmem_cache_free(xfs_rmapbt_cur_zone, cur);
+		break;
+	case XFS_BTNUM_REFCNT:
+		kmem_cache_free(xfs_refcntbt_cur_zone, cur);
+		break;
+	default:
+		ASSERT(0);
+		break;
+	}
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 4eaf8517f850..acdf087c853a 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -13,7 +13,11 @@ struct xfs_trans;
 struct xfs_ifork;
 struct xfs_perag;
 
-extern kmem_zone_t	*xfs_btree_cur_zone;
+extern struct kmem_cache	*xfs_allocbt_cur_zone;
+extern struct kmem_cache	*xfs_inobt_cur_zone;
+extern struct kmem_cache	*xfs_bmbt_cur_zone;
+extern struct kmem_cache	*xfs_rmapbt_cur_zone;
+extern struct kmem_cache	*xfs_refcntbt_cur_zone;
 
 /*
  * Generic key, ptr and record wrapper structures.
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 27190840c5d8..5258696f153e 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -22,6 +22,8 @@
 #include "xfs_rmap.h"
 #include "xfs_ag.h"
 
+struct kmem_cache	*xfs_inobt_cur_zone;
+
 STATIC int
 xfs_inobt_get_minrecs(
 	struct xfs_btree_cur	*cur,
@@ -432,7 +434,7 @@ xfs_inobt_init_common(
 {
 	struct xfs_btree_cur	*cur;
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
+	cur = kmem_cache_zalloc(xfs_inobt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_btnum = btnum;
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 1ef9b99962ab..20667f173040 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -21,6 +21,8 @@
 #include "xfs_rmap.h"
 #include "xfs_ag.h"
 
+struct kmem_cache	*xfs_refcntbt_cur_zone;
+
 static struct xfs_btree_cur *
 xfs_refcountbt_dup_cursor(
 	struct xfs_btree_cur	*cur)
@@ -322,7 +324,7 @@ xfs_refcountbt_init_common(
 
 	ASSERT(pag->pag_agno < mp->m_sb.sb_agcount);
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
+	cur = kmem_cache_zalloc(xfs_refcntbt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_btnum = XFS_BTNUM_REFC;
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index b7dbbfb3aeed..cb6e64f6d8f9 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -22,6 +22,8 @@
 #include "xfs_ag.h"
 #include "xfs_ag_resv.h"
 
+struct kmem_cache	*xfs_rmapbt_cur_zone;
+
 /*
  * Reverse map btree.
  *
@@ -451,7 +453,7 @@ xfs_rmapbt_init_common(
 {
 	struct xfs_btree_cur	*cur;
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
+	cur = kmem_cache_zalloc(xfs_rmapbt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	/* Overlapping btree; 2 keys per pointer. */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 90716b9d6e5f..3f97dc1b41e0 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1965,10 +1965,24 @@ xfs_init_zones(void)
 	if (!xfs_bmap_free_item_zone)
 		goto out_destroy_log_ticket_zone;
 
-	xfs_btree_cur_zone = kmem_cache_create("xfs_btree_cur",
+	xfs_allocbt_cur_zone = kmem_cache_create("xfs_allocbt_cur",
 					       sizeof(struct xfs_btree_cur),
 					       0, 0, NULL);
-	if (!xfs_btree_cur_zone)
+	xfs_inobt_cur_zone = kmem_cache_create("xfs_inobt_cur",
+					       sizeof(struct xfs_btree_cur),
+					       0, 0, NULL);
+	xfs_bmbt_cur_zone = kmem_cache_create("xfs_bmbt_cur",
+					       sizeof(struct xfs_btree_cur),
+					       0, 0, NULL);
+	xfs_rmapbt_cur_zone = kmem_cache_create("xfs_rmapbt_cur",
+					       sizeof(struct xfs_btree_cur),
+					       0, 0, NULL);
+	xfs_refcntbt_cur_zone = kmem_cache_create("xfs_refcnt_cur",
+					       sizeof(struct xfs_btree_cur),
+					       0, 0, NULL);
+	if (!xfs_allocbt_cur_zone || !xfs_inobt_cur_zone ||
+	    !xfs_bmbt_cur_zone || !xfs_rmapbt_cur_zone ||
+	    !xfs_refcntbt_cur_zone)
 		goto out_destroy_bmap_free_item_zone;
 
 	xfs_da_state_zone = kmem_cache_create("xfs_da_state",
@@ -2106,7 +2120,11 @@ xfs_init_zones(void)
  out_destroy_da_state_zone:
 	kmem_cache_destroy(xfs_da_state_zone);
  out_destroy_btree_cur_zone:
-	kmem_cache_destroy(xfs_btree_cur_zone);
+	kmem_cache_destroy(xfs_allocbt_cur_zone);
+	kmem_cache_destroy(xfs_inobt_cur_zone);
+	kmem_cache_destroy(xfs_bmbt_cur_zone);
+	kmem_cache_destroy(xfs_rmapbt_cur_zone);
+	kmem_cache_destroy(xfs_refcntbt_cur_zone);
  out_destroy_bmap_free_item_zone:
 	kmem_cache_destroy(xfs_bmap_free_item_zone);
  out_destroy_log_ticket_zone:
@@ -2138,7 +2156,11 @@ xfs_destroy_zones(void)
 	kmem_cache_destroy(xfs_trans_zone);
 	kmem_cache_destroy(xfs_ifork_zone);
 	kmem_cache_destroy(xfs_da_state_zone);
-	kmem_cache_destroy(xfs_btree_cur_zone);
+	kmem_cache_destroy(xfs_allocbt_cur_zone);
+	kmem_cache_destroy(xfs_inobt_cur_zone);
+	kmem_cache_destroy(xfs_bmbt_cur_zone);
+	kmem_cache_destroy(xfs_rmapbt_cur_zone);
+	kmem_cache_destroy(xfs_refcntbt_cur_zone);
 	kmem_cache_destroy(xfs_bmap_free_item_zone);
 	kmem_cache_destroy(xfs_log_ticket_zone);
 }



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux