On Thu, Oct 26, 2023 at 10:37:39AM +1100, Dave Chinner wrote: > On Wed, Oct 25, 2023 at 03:39:09PM -0700, Darrick J. Wong wrote: > > On Wed, Oct 25, 2023 at 01:03:00PM -0700, Omar Sandoval wrote: > > > On Wed, Oct 25, 2023 at 08:50:46AM -0700, Darrick J. Wong wrote: > > > > On Tue, Oct 24, 2023 at 10:14:07PM -0700, Omar Sandoval wrote: > > > > > On Wed, Oct 25, 2023 at 03:35:17PM +1100, Dave Chinner wrote: > > > > > > On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote: > > > > > > > From: Omar Sandoval <osandov@xxxxxx> > > > > > > > Fix it by adding an extra block of slack in the freelist for the > > > > > > > potential leaf split in each of the bnobt and cntbt. > > > > > > > > I see how the cntbt can split because changing the length of a freespace > > > > extent can require the record to move to a totally different part of the > > > > cntbt. The reinsertion causes the split. > > > > > > > > How does the bnobt split during a refresh of the AGFL? Will we ever > > > > allocate a block from the /middle/ of a free space extent? > > > > > > That's the case I was worried about for the bnobt. I see two ways that > > > can happen: > > > > > > 1. alignment, prod, or mod requirements, which the freelist doesn't > > > have. > > > 2. Busy extents. I don't know enough about XFS to say whether or not > > > this is applicable, but at first glance I don't see why not. > > > > Yes, I think it is possible -- we allocate an extent to fill the AGFL, > > the beginning of that extent is still busy due to slow discard, so > > xfs_alloc_compute_aligned gives us a block from the middle of the free > > space. Since AGFL filling has no particular alignment/prod/mod, any > > number of blocks are good enough, so we end up taking the blocks from > > the middle of the extent. > > *nod* > > That's exactly the conclusion I came to when I wondered how it was > possible... > > > > > > > Hmmm. yeah - example given is 2->3 level split, hence only need to > > > > > > handle a single level leaf split before path intersection occurs. > > > > > > Yup, adding an extra block will make the exact problem being seen go > > > > > > away. > > > > > > > > > > > > Ok, what's the general solution? 4-level, 5-level or larger trees? > > > > > > > > > > > > Worst split after a full split is up to the level below the old root > > > > > > block. the root block was split, so it won't need a split again, so > > > > > > only it's children could split again. OK, so that's (height - 1) > > > > > > needed for a max split to refill the AGFL after a full height split > > > > > > occurred. > > > > > > > > > > > > Hence it looks like the minimum AGFL space for any given > > > > > > allocation/free operation needs to be: > > > > > > > > > > > > (full height split reservation) + (AGFL refill split height) > > > > > > > > > > > > which is: > > > > > > > > > > > > = (new height) + (new height - 2) > > > > > > = 2 * new height - 2 > > > > > > = 2 * (current height + 1) - 2 > > > > > > = 2 * current height > > > > > > > > > > > > Ok, so we essentially have to double the AGFL minimum size to handle > > > > > > the generic case of AGFL refill splitting free space trees after a > > > > > > transaction that has exhausted it's AGFL reservation. > > > > > > > > > > > > Now, did I get that right? > > > > > > > > > > That sounds right, but I'll have to think about it more after some sleep > > > > > :) > > > > > > > > I think that's correct, assuming that (2 * current_height) is the > > > > per-btree calcluation. > > > > > > Agreed, except when the tree is already the maximum level. In that case, > > > the worst case is splitting up to but not including the root node twice, > > > which is 2 * height - 2 (i.e., stopping before Dave substituted new > > > height with current height + 1). So I think we want: > > > > > > min_free = min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1, > > > mp->m_alloc_maxlevels) * 2 - 2; > > > > > > > > Assuming that is correct, your LE search suggestion sounds kind of nice > > > > > rather than such a drastic change to the AGFL size. > > > > > > Come to think of it, if there is nothing <= the desired size but there > > > is something >, we have no choice but to do the split, so that doesn't > > > work. > > > > Yeah, I was kind of wondering that myself. > > Ok, I'm glad to see there are enough eyes on this to point out the > things I missed when musing about it as a possible solution... > > > > > The absolute maximum height of a free space btree is 7 blocks, according > > > > to xfs_db: > > > > > > > > # xfs_db /dev/sda -c 'btheight -w absmax all' > > > > bnobt: 7 > > > > cntbt: 7 > > > > inobt: 6 > > > > finobt: 6 > > > > bmapbt: 14 > > > > refcountbt: 6 > > > > rmapbt: 10 > > > > > > > > The smallest AGFL is 62 records long (V4 fs, 512b blocks) so I don't > > > > think it's /that/ drastic. For a filesystem with rmapbt (V5, 1k blocks) > > > > that minimum jumps to 121 blocks. > > Yup. keep in mind this comment in xfs_alloc_fix_freelist(): > > * XXX (dgc): When we have lots of free space, does this buy us > * anything other than extra overhead when we need to put more blocks > * back on the free list? Maybe we should only do this when space is > * getting low or the AGFL is more than half full? > > IOWs, I've previously mused about keeping the AGFL much fuller than > the absolute minimum. There doesn't seem to be any real gain to > keeping it at just the minimum size when we are nowhere near ENOSPC, > it just means we are doing lots of management operations like > reducing it immediately after a transaction that has released blocks > to the freelist, then only to have to extend it again when we do an > operation that consumes blocks from the free list. > > i.e. should we add also hysteresis to limit the number of AGFL > size manipulations we need to do in mixed workloads? i.e. we don't free > blocks from the AGFL until it might get overfilled by an operation > (i.e. max size - max merge free blocks) or we are nearing ENOSPC, > and we don't fill till we are at the minimum. In either case, the > target for empty or fill operations is "half full"... Hmm. What if we kept the AGFL completely full? On that 512b-fsblock V4 filesystem, that'd use up 128 blocks instead of 7 blocks. 128 blocks == 64K per AG. On a 64K-fsblock V5 filesystem, that'd use up 16370 64K blocks per AG, or ~1GB per AG. Ok, that might be too much. On a 4K-fsblock V5 fs, that's 1010 4K blocks per AG, or ~4MB per AG. A gigabyte is a bit ridiculous, but what if we always did the maximum bnobt/cntbt/rmapbt height instead of current_height + 1? For that 4K-fsblock filesystem, that's 24 blocks per AG, or 96K. Given that we only support 300M+ filesystems now, that's 75M per AG. I feel like we could go crazy and try to keep 640K in the AGFL. 640K ought to be enough for anyone. (Numbers chosen arbitrarily, of course.) --D > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx