On Wed, Oct 25, 2023 at 08:50:46AM -0700, Darrick J. Wong wrote: > On Tue, Oct 24, 2023 at 10:14:07PM -0700, Omar Sandoval wrote: > > On Wed, Oct 25, 2023 at 03:35:17PM +1100, Dave Chinner wrote: > > > On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote: > > > > From: Omar Sandoval <osandov@xxxxxx> > > > > Fix it by adding an extra block of slack in the freelist for the > > > > potential leaf split in each of the bnobt and cntbt. > > I see how the cntbt can split because changing the length of a freespace > extent can require the record to move to a totally different part of the > cntbt. The reinsertion causes the split. > > How does the bnobt split during a refresh of the AGFL? Will we ever > allocate a block from the /middle/ of a free space extent? That's the case I was worried about for the bnobt. I see two ways that can happen: 1. alignment, prod, or mod requirements, which the freelist doesn't have. 2. Busy extents. I don't know enough about XFS to say whether or not this is applicable, but at first glance I don't see why not. > > > Hmmm. yeah - example given is 2->3 level split, hence only need to > > > handle a single level leaf split before path intersection occurs. > > > Yup, adding an extra block will make the exact problem being seen go > > > away. > > > > > > Ok, what's the general solution? 4-level, 5-level or larger trees? > > > > > > Worst split after a full split is up to the level below the old root > > > block. the root block was split, so it won't need a split again, so > > > only it's children could split again. OK, so that's (height - 1) > > > needed for a max split to refill the AGFL after a full height split > > > occurred. > > > > > > Hence it looks like the minimum AGFL space for any given > > > allocation/free operation needs to be: > > > > > > (full height split reservation) + (AGFL refill split height) > > > > > > which is: > > > > > > = (new height) + (new height - 2) > > > = 2 * new height - 2 > > > = 2 * (current height + 1) - 2 > > > = 2 * current height > > > > > > Ok, so we essentially have to double the AGFL minimum size to handle > > > the generic case of AGFL refill splitting free space trees after a > > > transaction that has exhausted it's AGFL reservation. > > > > > > Now, did I get that right? > > > > That sounds right, but I'll have to think about it more after some sleep > > :) > > I think that's correct, assuming that (2 * current_height) is the > per-btree calcluation. Agreed, except when the tree is already the maximum level. In that case, the worst case is splitting up to but not including the root node twice, which is 2 * height - 2 (i.e., stopping before Dave substituted new height with current height + 1). So I think we want: min_free = min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1, mp->m_alloc_maxlevels) * 2 - 2; > > Assuming that is correct, your LE search suggestion sounds kind of nice > > rather than such a drastic change to the AGFL size. Come to think of it, if there is nothing <= the desired size but there is something >, we have no choice but to do the split, so that doesn't work. > The absolute maximum height of a free space btree is 7 blocks, according > to xfs_db: > > # xfs_db /dev/sda -c 'btheight -w absmax all' > bnobt: 7 > cntbt: 7 > inobt: 6 > finobt: 6 > bmapbt: 14 > refcountbt: 6 > rmapbt: 10 > > The smallest AGFL is 62 records long (V4 fs, 512b blocks) so I don't > think it's /that/ drastic. For a filesystem with rmapbt (V5, 1k blocks) > that minimum jumps to 121 blocks. > > > > The rmapbt case will need this change, too, because rmapbt blocks > > > are allocated from the free list and so an rmapbt update can exhaust > > > the free list completely, too. > > > > Ah, I missed where the rmapbt is updated since it was slightly removed > > from the xfs_alloc_fixup_trees() code path I was looking at. > > The rmapbt has its own accounting tricks (XFS_AG_RESV_RMAPBT) to ensure > that there's always enough free space to refill the AGFL. Is that why > the testcase turns off rmapbt? Nope, I turned it off in my test case because with the rmapbt enabled, the freelist is larger. Therefore, for this bug to happen, the bnobt, cntbt, and rmapbt all need a maximum split at the same time. This is "easy" to do for just the bnobt and cntbt since they store the same records, but it's a lot harder to get the rmapbt in on it, too. Nevertheless, it seems possible, right? XFS_AG_RESV_RMAPBT only seems to guarantee that there is space in the AG to refill the AGFL, but that doesn't mean we won't need to split the rmapbt at an inopportune time?