Re: [PATCH 10/43] xfs: make metabtree reservations global

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 6 Feb 2025 20:24:54 -0800

On Fri, Feb 07, 2025 at 05:18:48AM +0100, Christoph Hellwig wrote:
> On Thu, Feb 06, 2025 at 12:50:21PM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 06, 2025 at 07:44:26AM +0100, Christoph Hellwig wrote:
> > > Currently each metabtree inode has it's own space reservation to ensure
> > > it can be expanded to the maximum size.  This is not very efficient as it
> > > requires a large number of blocks to be set aside that can't be used at
> > > all by other btrees.
> > > 
> > > Switch to a model that uses a global pool instead in preparation for
> > > reducing the amount of reserved space.
> > 
> > ...because inodes draw from the global pool, so there's no reason to
> > compartmentalize each rt rmap btree's reservation like we do for per-AG
> > rmap btrees, right?
> 
> Yes.
> 
> > > -	to_resv = ip->i_meta_resv_asked - (ip->i_nblocks + ip->i_delayed_blks);
> > > +	to_resv = mp->m_metafile_resv_target -
> > > +		(mp->m_metafile_resv_used + mp->m_metafile_resv_avail);
> > 
> > This separates accounting of the usage of the metadata btree (e.g.
> > rtrmapbt) from i_blocks, which is a convenient way to fix an accounting
> > bug if ever there's a metadir inode with a big enough xattr structure
> > that it blows out into a real ondisk block. So far I don't think that
> > happens in practice, but it's not impossible.
> > 
> 
> Heh.  I didn't think of that, but yes even if very theoretical at the
> moment.
> 
> > Clearly, the mapping of before to after is:
> > 
> > i_meta_resv_asked	-> m_metafile_resv_target
> > i_nblocks		-> m_metafile_resv_used
> > i_delayed_blks		-> m_metafile_resv_avail
> > 
> > Makes sense.  And I guess we still consider the reservations as "delayed
> > allocations" in the sense that we subtract from fdblocks and put the
> > same quantity into m_delalloc_blks.
> 
> Yes.
> 
> > > @@ -932,13 +932,17 @@ xrep_reap_metadir_fsblocks(
> > >  	xfs_rmap_ino_bmbt_owner(&oinfo, sc->ip->i_ino, XFS_DATA_FORK);
> 
> ...
> 
> > > +	/*
> > > +	 * Resize the reservations so that we don't fail to expand the btree.
> > > +	 */
> > > +	return xfs_metafile_resv_init(sc->mp);
> > 
> > I'm not sure this fixes the reservation correctly -- let's say we have a
> > busted btree inode with a 100 fsblock reservation that was using 20 fsb.
> > At mount, we'll have called xfs_dec_fdblocks for 100 - 20 = 80 fsb:
> > 
> > 	hidden_space = target - used;
> > 	xfs_dec_fdblocks(mp, hidden_space, true);
> > 
> > Then we rebuild it with a more compact 18 fsb btree.  Now we run
> > xfs_metafile_resv_init and it thinks it needs to call xfs_dec_fdblocks
> > for 100 - 18 = 82 fsb.  But we already reserved 80 for this file, so the
> > correct thing to do here is to xfs_dec_fdblocks 2fsb from fdblocks, not
> > another 82.
> 
> Hmm, yes.  If only we had a good test for this :)

You could populate an rt filesystem with a lot of small mappings, then
use magic debug switches to xfs_repair to rebuild the rt btrees with
half full leaf and node blocks, and then force online repair to rebuild
it.  Then it'll rebuild them with 75% full blocks.  Come to think of it,
IIRC online repair also has secret debug knobs for that kind of thing.

--D