On Tue, 01 Dec 2020 08:51:14 +1100, Dave Chinner wrote: > On Mon, Nov 30, 2020 at 02:35:21PM +0530, Chandan Babu R wrote: > > The comment in xfs_rmapbt_compute_maxlevels() mentions that with > > reflink enabled, XFS will run out of AG blocks before reaching maximum > > levels of XFS_BTREE_MAXLEVELS (i.e. 9). This is easy to prove for 4k > > block size case: > > > > Considering theoretical limits, maximum height of rmapbt can be, > > max btree height = Log_(min_recs)(total recs) > > max_rmapbt_height = Log_45(2^64) = 12. > > I think the use of mirecs here is wrong. We can continue to fill > both leaves and nodes above minrecs once we get to maximal height. > When a leaf/node is full, it will attempt to shift records left and > right to sibling nodes before trying to split. Hence a split only > occurs when all three nodes are completely full. > > Hence we'll end up with a much higher average population of leaves > and nodes than the minimum when we are at maximum height, especially > at the upper levels of the btree near the root. > > i.e. we won't try to split the root ever until the root is at > maximum capacity, and we won't try to split the next level down > (before we get to a root split) until at least 3 adjacent nodes > are at max capacity, and so on. Hence at higher levels, the tree is > always going to tend towards max capacity nodes, not min capacity. > That changes these calculations quite a lot... > > > Detailed calculation: > > nr-levels = 1; nr-leaf-blks = 2^64 / 84 = 2e17; > > nr-levels = 2; nr-blks = 2e17 / 45 = 5e15; > > nr-levels = 3; nr-blks = 5e15 / 45 = 1e14; > > nr-levels = 4; nr-blks = 1e14 / 45 = 2e12; > > nr-levels = 5; nr-blks = 2e12 / 45 = 5e10; > > nr-levels = 6; nr-blks = 5e10 / 45 = 1e9; > > nr-levels = 7; nr-blks = 1e9 / 45 = 3e7; > > nr-levels = 8; nr-blks = 3e7 / 45 = 6e5; > > nr-levels = 9; nr-blks = 6e5 / 45 = 1e4; > > nr-levels = 10; nr-blks = 1e4 / 45 = 3e2; > > nr-levels = 11; nr-blks = 3e2 / 45 = 6; > > nr-levels = 12; nr-blks = 1; > > > > Total number of blocks = 2e17 > > > > Here, 84 is the minimum number of leaf records and 45 is the minimum > > number of node records in the rmapbt when using 4k block size. 2^64 is > > the maximum possible rmapbt records > > (i.e. max_rmap_entries_per_disk_block (2^32) * max_nr_agblocks > > (2^32)). > > > > i.e. theoretically rmapbt height can go upto 12. > > Yes, but if the rmapbt contains 2^64 records, how many physical disk > blocks does it consume itself? Worst case that's 2^64 / 45, so > somewhere between 2^58 and 2^59 blocks of storage required for > indexing all those reocrds with a 4kB block size? > > Yet we only have 2^28 4kB blocks in an AG, and the limit of a > rmapbt is actually what can physically fit in an AG, yes? > > So, regardless of what the theoretical record capacity of a worse > case rmapbt record fill might be, it hits physical record storage > limits long before that. i.e. the limit on the btree height is > physical storage, not theoretical record indexing capability. > > In which case, let us [incorrectly, see later] assume we have a > handful of maximally shared single data blocks and the rest of the > 1TB AG is rmapbt. i.e. the maximum depth of the rmapbt is > determined by how much physical space it can consume, which is very > close to 2^32 blocks for 1kB filesystems. > > Let's work from max capacity, so 2^32 * 21 is the max record holding > capacity of the per-ag rmapbt. Hence the worst case index height > requirement for the rmapbt is indexing ~2^37 records. > > nr-levels = 1; nr-leaf-blks = 2^32 * 21 / 21 = 4e9 > nr-levels = 2; nr-blks = 4e9 / 21 = 2e8; > nr-levels = 3; nr-blks = 2e8 / 21 = 9e6; > nr-levels = 4; nr-blks = 9e6/ 21 = 4e5; > nr-levels = 5; nr-blks = 4e5 / 21 = 2e4; > nr-levels = 6; nr-blks = 2e4 / 21 = 1e3; > nr-levels = 7; nr-blks = 1e3 / 21 = 50 > nr-levels = 8; nr-blks = 50 / 21 = 3; > nr-levels = 9; nr-blks = 3 / 21 = 1; > nr-levels = 10; nr-blks = 1; > > So we *just* tick over into a 10 level rmapbt here in this worse > case where the rmapbt takes *all* of the AG space. In comparison, > min capacity lower nodes, max capacity higher nodes: > > nr-levels = 1; nr-leaf-blks = 2^32 * 11 / 11 = 4e9 > nr-levels = 2; nr-blks = 4e9 / 11 = 4e8; > nr-levels = 3; nr-blks = 4e8 / 11 = 4e7; > nr-levels = 4; nr-blks = 4e7/ 11 = 3e6; > nr-levels = 5; nr-blks = 3e6 / 21 = 2e5; > nr-levels = 6; nr-blks = 2e5 / 21 = 7e3; > nr-levels = 7; nr-blks = 7e3 / 21 = 348; > nr-levels = 8; nr-blks = 348 / 21 = 16; > nr-levels = 9; nr-blks = 16 / 21 = 1; > nr-levels = 10; nr-blks = 1; > > Same, it's a 10 level tree. > > > But as the comment in xfs_rmapbt_compute_maxlevels() suggests, we will > > run out of per-ag blocks trying to build an rmapbt of height > > XFS_BTREE_MAXLEVELS (i.e. 9). > > > > Since number of nodes grows as a geometric series, > > nr_nodes (roughly) = (45^9 - 1) / (45 - 1) = 10e12 > > > > i.e. 10e12 blocks > max ag blocks (2^32 == 4e9) > > > > However, with 1k block size we are not close to consuming all of 2^32 > > AG blocks as shown by the below calculations, > > > > - rmapbt with maximum of 9 levels will have roughly (11^9 - 1) / (11 - > > 1) = 2e8 blocks. > > 2.35e8, which is a quarter of the max AG space, assuming minimum > capacity nodes. Assuming max capacity nodes, we're at > 3.9e10 blocks, which is almost 40x the number of blocks in the AG... > > > - 11 is the minimum number of recs in a non-leaf node with 1k block size. > > - Also, Total number of records (roughly) = (nr_leaves * 11) = 11^8 * 11 > > = 2e9 (this value will be used later). > > > > - refcountbt > > - Maximum number of records theoretically = maximum number of blocks > > in an AG = 2^32 > > The logic here is flawed - you're assuming exclusive use of the AG > to hold *both* rmapbt records and shared data extents. Every block > used for an rmap record cannot have a refcount record pointing at > it because per-ag metadata cannot be shared. > > > - Total (leaves and non-leaf nodes) blocks required to hold 2^32 records > > Leaf min recs = 20; Node min recs = 60 (with 1k as the block size). > > - Detailed calculation: > > nr-levels = 1; nr-leaf-blks = 2^32 / 20 = 2e8; > > nr-levels = 2; nr-blks = 2e8 / 60 = 4e6 > > nr-levels = 3; nr-blks = 4e6 / 60 = 6e4 > > nr-levels = 4; nr-blks = 6e4 / 60 = 1.0e3 > > nr-levels = 5; nr-blks = 1.0e3 / 60 = 2e1 > > nr-levels = 6; nr-blks = 1 > > > > - Total block count = 2e8 > > So, if we assume that 20% of the AG space is refcount btree records > like this, then the rmapbt only consumes 80% of the AG, then the > rmap btree is much closer to the 9 level limit, especially if > we consider we'll tend towards max capacity nodes as we fill the > tree, not min capacity nodes. > > But the reality is that we could have a single refcount record with > 2^32 references, and that requires 2^32 rmapbt records. So if we are > talking about really high extent refcount situations, the worst case > is a handful of single blocks with billions of references. IOWs, the > refcountbt is a single block with ~2^5 entries in it, each which > track 2^32 references. Now we require 2^37 rmapbt records, and we've > overflowed the maximum physical storage capacity of the AG for > rmapbt blocks. Thanks for the detailed explanation. > > > - Bmbt (assuming all the rmapbt records have the same inode as owner) > > - Total (leaves and non-leaf nodes) blocks required to hold 2e9 records > > Leaf min recs = 29; Node min recs = 29 (with 1k as the block size). > > (2e9 is the maximum rmapbt records with rmapbt height 9 and 1k block size). > > nr-levels = 1; nr-leaf-blks = 2e9 / 29 = 7e7 > > nr-levels = 2; nr-blks = 7e7 / 29 = 2e6 > > nr-levels = 3; nr-blks = 2e6 / 29 = 8e4 > > nr-levels = 4; nr-blks = 8e4 / 29 = 3e3 > > nr-levels = 5; nr-blks = 3e3 / 29 = 1e2 > > nr-levels = 6; nr-blks = 1e2 / 29 = 3 > > nr-levels = 7; nr-blks = 1 > > > > - Total block count = 7e7 > > > > Total blocks used across rmapbt, refcountbt and bmbt = 2e8 + 2e8 + 7e7 = 5e8. > > BMBT blocks are not bound to an AG the way refcount and rmapbt > records are. They can be anywhere in the filesystem, so they play no > part in calculations like these. > > > Since 5e8 < 4e9(i.e. 2^32), we have not run out of blocks trying > > to build a rmapbt with XFS_BTREE_MAXLEVELS (i.e 9) levels. > > > > Please let me know if my understanding is incorrect. > > The worst case theory is largely sound, but theory != reality. > > The actual reality of someone with billions of refcount entries to a > a handful of single extents in a single AG is using a 1kB block size > is likely to be so rare that we shouldn't consider it a critical > design point. > > That is, if the values cover the default 4kB block size just fine, > then we should not penalise every common configuration just to > handle an extreme case in an extremely rare corner case for a > non-default configuration. Ok. > > > I have come across a "log reservation" calculation issue when > > increasing XFS_BTREE_MAXLEVELS to 10 which is in turn required for > > extending data fork extent count to 48 bits. > > What is that issue? Sorry, I should have described it when sending the original mail. Increasing the value of XFS_BTREE_MAXLEVELS to 10 caused the value of mp->m_rmap_maxlevels to be set to 10 when reflink feature is enabled. This in turn increased "min log size" calculations and hence a modified kernel (one with XFS_BTREE_MAXLEVELS set to 10) will fail to mount a filesystem which was created by mkfs.xfs which used 9 as the value for XFS_BTREE_MAXLEVELS for its "min log size" calculations. Of course, this problem is no longer valid because ... > > The BMBT modifications should not care about XFS_BTREE_MAXLEVELS as > a limit, we use the calculated mp->m_bm_maxlevels[] variables for > BMBT height. These are bound by the maximum number of extents the > tree can index, not the depth of btrees allowed within an AG. > Changing the size of the BMBT tree should not affect the per-ag > btrees in any way, nor the log reservations required to manipulate > the per-ag btrees... Ah, I didn't realize that XFS_BTREE_MAXLEVELS was applicable only to per-AG btrees. As mentioned in your response to Darrick, memory for "struct xfs_btree_cur" should be allocated based on the maximum height of the btree that the code is about to traverse. Thanks once again for providing an insight into this. -- chandan