Re: Maximum height of rmapbt when reflink feature is enabled

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Nov 30, 2020 at 02:35:21PM +0530, Chandan Babu R wrote:
> The comment in xfs_rmapbt_compute_maxlevels() mentions that with
> reflink enabled, XFS will run out of AG blocks before reaching maximum
> levels of XFS_BTREE_MAXLEVELS (i.e. 9).  This is easy to prove for 4k
> block size case:
> 
> Considering theoretical limits, maximum height of rmapbt can be,
> max btree height = Log_(min_recs)(total recs)
> max_rmapbt_height = Log_45(2^64) = 12.

I think the use of mirecs here is wrong. We can continue to fill
both leaves and nodes above minrecs once we get to maximal height.
When a leaf/node is full, it will attempt to shift records left and
right to sibling nodes before trying to split. Hence a split only
occurs when all three nodes are completely full.

Hence we'll end up with a much higher average population of leaves
and nodes than the minimum when we are at maximum height, especially
at the upper levels of the btree near the root.

i.e. we won't try to split the root ever until the root is at
maximum capacity, and we won't try to split the next level down
(before we get to a root split) until at least 3 adjacent nodes
are at max capacity, and so on. Hence at higher levels, the tree is
always going to tend towards max capacity nodes, not min capacity.
That changes these calculations quite a lot...

> Detailed calculation:
> nr-levels = 1; nr-leaf-blks = 2^64 / 84 = 2e17;
> nr-levels = 2; nr-blks = 2e17 / 45 = 5e15;
> nr-levels = 3; nr-blks = 5e15 / 45 = 1e14;
> nr-levels = 4; nr-blks = 1e14 / 45 = 2e12;
> nr-levels = 5; nr-blks = 2e12 / 45 = 5e10;
> nr-levels = 6; nr-blks = 5e10 / 45 = 1e9;
> nr-levels = 7; nr-blks = 1e9 / 45 = 3e7;
> nr-levels = 8; nr-blks = 3e7 / 45 = 6e5;
> nr-levels = 9; nr-blks = 6e5 / 45 = 1e4;
> nr-levels = 10; nr-blks = 1e4 / 45 = 3e2;
> nr-levels = 11; nr-blks = 3e2 / 45 = 6;
> nr-levels = 12; nr-blks = 1;
> 
> Total number of blocks = 2e17
>
> Here, 84 is the minimum number of leaf records and 45 is the minimum
> number of node records in the rmapbt when using 4k block size. 2^64 is
> the maximum possible rmapbt records
> (i.e. max_rmap_entries_per_disk_block (2^32) * max_nr_agblocks
> (2^32)).
>
> i.e. theoretically rmapbt height can go upto 12.

Yes, but if the rmapbt contains 2^64 records, how many physical disk
blocks does it consume itself? Worst case that's 2^64 / 45, so
somewhere between 2^58 and 2^59 blocks of storage required for
indexing all those reocrds with a 4kB block size?

Yet we only have 2^28 4kB blocks in an AG, and the limit of a
rmapbt is actually what can physically fit in an AG, yes?

So, regardless of what the theoretical record capacity of a worse
case rmapbt record fill might be, it hits physical record storage
limits long before that. i.e. the limit on the btree height is
physical storage, not theoretical record indexing capability.

In which case, let us [incorrectly, see later] assume we have a
handful of maximally shared single data blocks and the rest of the
1TB AG is rmapbt. i.e.  the maximum depth of the rmapbt is
determined by how much physical space it can consume, which is very
close to 2^32 blocks for 1kB filesystems.

Let's work from max capacity, so 2^32 * 21 is the max record holding
capacity of the per-ag rmapbt.  Hence the worst case index height
requirement for the rmapbt is indexing ~2^37 records.

nr-levels = 1; nr-leaf-blks = 2^32 * 21 / 21 = 4e9
nr-levels = 2; nr-blks = 4e9 / 21 = 2e8;
nr-levels = 3; nr-blks = 2e8 / 21 = 9e6;
nr-levels = 4; nr-blks = 9e6/ 21 = 4e5;
nr-levels = 5; nr-blks = 4e5 / 21 = 2e4;
nr-levels = 6; nr-blks = 2e4 / 21 = 1e3;
nr-levels = 7; nr-blks = 1e3 / 21 = 50
nr-levels = 8; nr-blks = 50 / 21 = 3;
nr-levels = 9; nr-blks = 3 / 21 = 1;
nr-levels = 10; nr-blks = 1;

So we *just* tick over into a 10 level rmapbt here in this worse
case where the rmapbt takes *all* of the AG space.  In comparison,
min capacity lower nodes, max capacity higher nodes:

nr-levels = 1; nr-leaf-blks = 2^32 * 11 / 11 = 4e9
nr-levels = 2; nr-blks = 4e9 / 11 = 4e8;
nr-levels = 3; nr-blks = 4e8 / 11 = 4e7;
nr-levels = 4; nr-blks = 4e7/ 11 = 3e6;
nr-levels = 5; nr-blks = 3e6 / 21 = 2e5;
nr-levels = 6; nr-blks = 2e5 / 21 = 7e3;
nr-levels = 7; nr-blks = 7e3 / 21 = 348;
nr-levels = 8; nr-blks = 348 / 21 = 16;
nr-levels = 9; nr-blks = 16 / 21 = 1;
nr-levels = 10; nr-blks = 1;

Same, it's a 10 level tree.

> But as the comment in xfs_rmapbt_compute_maxlevels() suggests, we will
> run out of per-ag blocks trying to build an rmapbt of height
> XFS_BTREE_MAXLEVELS (i.e. 9).
> 
> Since number of nodes grows as a geometric series,
> nr_nodes (roughly) = (45^9 - 1) / (45 - 1) = 10e12
> 
> i.e. 10e12 blocks > max ag blocks (2^32 == 4e9)
> 
> However, with 1k block size we are not close to consuming all of 2^32
> AG blocks as shown by the below calculations,
> 
>  - rmapbt with maximum of 9 levels will have roughly (11^9 - 1) / (11 -
>    1) = 2e8 blocks.

2.35e8, which is a quarter of the max AG space, assuming minimum
capacity nodes. Assuming max capacity nodes, we're at
3.9e10 blocks, which is almost 40x the number of blocks in the AG...

>    - 11 is the minimum number of recs in a non-leaf node with 1k block size.
>    - Also, Total number of records (roughly) = (nr_leaves * 11) = 11^8 * 11
>      = 2e9 (this value will be used later).
>  
>  - refcountbt
>    - Maximum number of records theoretically = maximum number of blocks
>      in an AG = 2^32

The logic here is flawed - you're assuming exclusive use of the AG
to hold *both* rmapbt records and shared data extents. Every block
used for an rmap record cannot have a refcount record pointing at
it because per-ag metadata cannot be shared.

>    - Total (leaves and non-leaf nodes) blocks required to hold 2^32 records
>      Leaf min recs = 20;  Node min recs = 60 (with 1k as the block size).
>      - Detailed calculation:
>  	    nr-levels = 1; nr-leaf-blks = 2^32 / 20 = 2e8;
>  	    nr-levels = 2; nr-blks = 2e8 / 60 = 4e6
>  	    nr-levels = 3; nr-blks = 4e6 / 60 = 6e4
>  	    nr-levels = 4; nr-blks = 6e4 / 60 = 1.0e3
>  	    nr-levels = 5; nr-blks = 1.0e3 / 60 = 2e1
>  	    nr-levels = 6; nr-blks = 1
>  
>      - Total block count = 2e8

So, if we assume that 20% of the AG space is refcount btree records
like this, then the rmapbt only consumes 80% of the AG, then the
rmap btree is much closer to the 9 level limit, especially if
we consider we'll tend towards max capacity nodes as we fill the
tree, not min capacity nodes.

But the reality is that we could have a single refcount record with
2^32 references, and that requires 2^32 rmapbt records. So if we are
talking about really high extent refcount situations, the worst case
is a handful of single blocks with billions of references. IOWs, the
refcountbt is a single block with ~2^5 entries in it, each which
track 2^32 references. Now we require 2^37 rmapbt records, and we've
overflowed the maximum physical storage capacity of the AG for
rmapbt blocks.

>  - Bmbt (assuming all the rmapbt records have the same inode as owner)
>    - Total (leaves and non-leaf nodes) blocks required to hold 2e9 records
>      Leaf min recs = 29;  Node min recs = 29 (with 1k as the block size).
>      (2e9 is the maximum rmapbt records with rmapbt height 9 and 1k block size).
>        nr-levels = 1; nr-leaf-blks = 2e9 / 29 = 7e7
>        nr-levels = 2; nr-blks = 7e7 / 29 = 2e6
>        nr-levels = 3; nr-blks = 2e6 / 29 = 8e4
>        nr-levels = 4; nr-blks = 8e4 / 29 = 3e3
>        nr-levels = 5; nr-blks = 3e3 / 29 = 1e2
>        nr-levels = 6; nr-blks = 1e2 / 29 = 3
>        nr-levels = 7; nr-blks = 1
>  
>    - Total block count = 7e7
>  
>  Total blocks used across rmapbt, refcountbt and bmbt = 2e8 + 2e8 + 7e7 = 5e8.

BMBT blocks are not bound to an AG the way refcount and rmapbt
records are. They can be anywhere in the filesystem, so they play no
part in calculations like these.

> Since 5e8 < 4e9(i.e. 2^32), we have not run out of blocks trying
> to build a rmapbt with XFS_BTREE_MAXLEVELS (i.e 9) levels.
> 
> Please let me know if my understanding is incorrect.

The worst case theory is largely sound, but theory != reality.

The actual reality of someone with billions of refcount entries to a
a handful of single extents in a single AG is using a 1kB block size
is likely to be so rare that we shouldn't consider it a critical
design point.

That is, if the values cover the default 4kB block size just fine,
then we should not penalise every common configuration just to
handle an extreme case in an extremely rare corner case for a
non-default configuration.

> I have come across a "log reservation" calculation issue when
> increasing XFS_BTREE_MAXLEVELS to 10 which is in turn required for
> extending data fork extent count to 48 bits.

What is that issue?

The BMBT modifications should not care about XFS_BTREE_MAXLEVELS as
a limit, we use the calculated mp->m_bm_maxlevels[] variables for
BMBT height. These are bound by the maximum number of extents the
tree can index, not the depth of btrees allowed within an AG.
Changing the size of the BMBT tree should not affect the per-ag
btrees in any way, nor the log reservations required to manipulate
the per-ag btrees...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux