On Tue, Nov 28, 2023 at 09:56:59PM -0800, Christoph Hellwig wrote: > On Tue, Nov 28, 2023 at 01:13:58PM -0800, Darrick J. Wong wrote: > > > - xrep_abt_find_freespace accounts the old allocbt blocks as freespace > > > per the comment, although as far as I understand the code in > > > xrep_abt_walk_rmap the allocbt blocks aren't actually added to the > > > free_records xfarray, but recorded into the old_allocbt_blocks > > > bitmap (btw, why are we using different data structures for them?) > > > > The old free space btree blocks are tracked via old_allocbt_blocks so > > that we can reap the space after committing the new btree roots. > > Yes, I got that from the code. But I'm a bit confused about the > comment treating the old allocbt blocks as free space, while the code > doesn't. Or I guess the confusion is that we really have two slightly > different notions of "free space": > > 1) the space we try to build the trees for > 2) the space used as free space to us for the trees > > The old allocbt blocks are part of 1 but not of 2. Yeah. The blocks for #2 only come from the gaps in the rmapbt records that we find during the scan. Part of the confusion here might be the naming -- at the end of the rmapbt scan, old_allocbt_blocks is set to all the blocks that the rmapbt says are OWN_AG. Each rmapbt block encountered during the scan is added to not_allocbt_blocks, and afterwards each block on the AGFL is also added to not_allocbt_blocks. After that is where the definitions of old_allocbt_blocks shifts. This expression will help us to identify possible former bnobt/cntbt blocks: (OWN_AG blocks) & ~(rmapbt blocks | agfl blocks); Substituting from above definitions, that becomes: old_allocbt_blocks & ~not_allocbt_blocks The OWN_AG bitmap itself isn't needed after this point, so what we really do instead is: old_allocbt_blocks &= ~not_allocbt_blocks; IOWs, after this point, "old_allocbt_blocks" is a bitmap of alleged former bnobt/cntbt blocks. The xagb_bitmap_disunion operation modifies its first parameter in place to avoid copying records around. The gaps (aka the new freespace records) are stashed in the free_records xfarray. Next, some of the @free_records are diverted to the newbt reservation and used to format new btree blocks. I hope that makes things clearer? > > > - what happens if the AG is so full and fragmented that we do not > > > have space to create a second set of allocbts without tearing down > > > the old ones first? > > > > xrep_abt_reserve_space returns -ENOSPC, so we throw away all the incore > > records and throw the errno out to userspace. Across all the btree > > rebuilding code, the block reservation step happens as the very last > > thing before we start writing btree blocks, so it's still possible to > > back out cleanly. > > But that means we can't repair the allocation btrees for this case. Yep. At least we revert cleanly here -- when that happens to xfs_repair it just blows up and leaves a dead fs. :/ Practically speaking, the rmapbt reservations are large enough that we can rebuild the trees if we have to, even if that just means stealing from the per-AG space reservations later. Though you could create the fs from hell by using reflink and cow to fragmenting the rmapbt until it swells up and consumes the entire AG. That would require billions of fragments; even with my bmap inflation xfs_db commands that still takes about 2 days to get the rmapbt to the point of overflowing the regular reservation. I tried pushing it to eat an entire AG but then the ILOM exploded and no idea what happened to that machine and its 512G of RAM. The lab people kind of hate me now. > > > > + /* We require the rmapbt to rebuild anything. */ > > > > + if (!xfs_has_rmapbt(mp)) > > > > + return -EOPNOTSUPP; > > > > > > Shoudn't this be checked by the .has callback in struct xchk_meta_ops? > > > > No. Checking doesn't require the rmapbt because all we do is walk the > > bnobt/cntbt records and cross-reference them with whatever metadata we > > can find. > > Oh, and .has applies to both checking and repairing. Gt it. Correct. > > This wrapper simplifes the interface to xrep_newbt_add_blocks so that > > external callers don't have to know which magic values of xfs_alloc_arg > > are actually used by xrep_newbt_add_blocks and therefore need to be set. > > > > For all the other repair functions, they have to allocate space from the > > free space btree, so xrep_newbt_add_blocks is passed the full > > _alloc_args as returned by the allocator to xrep_newbt_alloc_ag_blocks. > > Ok. Maybe throw in a comment that it just is a convenience wrapper? Will do. --D