On Fri, Feb 07, 2025 at 05:18:48AM +0100, Christoph Hellwig wrote: > On Thu, Feb 06, 2025 at 12:50:21PM -0800, Darrick J. Wong wrote: > > On Thu, Feb 06, 2025 at 07:44:26AM +0100, Christoph Hellwig wrote: > > > Currently each metabtree inode has it's own space reservation to ensure > > > it can be expanded to the maximum size. This is not very efficient as it > > > requires a large number of blocks to be set aside that can't be used at > > > all by other btrees. > > > > > > Switch to a model that uses a global pool instead in preparation for > > > reducing the amount of reserved space. > > > > ...because inodes draw from the global pool, so there's no reason to > > compartmentalize each rt rmap btree's reservation like we do for per-AG > > rmap btrees, right? > > Yes. > > > > - to_resv = ip->i_meta_resv_asked - (ip->i_nblocks + ip->i_delayed_blks); > > > + to_resv = mp->m_metafile_resv_target - > > > + (mp->m_metafile_resv_used + mp->m_metafile_resv_avail); > > > > This separates accounting of the usage of the metadata btree (e.g. > > rtrmapbt) from i_blocks, which is a convenient way to fix an accounting > > bug if ever there's a metadir inode with a big enough xattr structure > > that it blows out into a real ondisk block. So far I don't think that > > happens in practice, but it's not impossible. > > > > Heh. I didn't think of that, but yes even if very theoretical at the > moment. > > > Clearly, the mapping of before to after is: > > > > i_meta_resv_asked -> m_metafile_resv_target > > i_nblocks -> m_metafile_resv_used > > i_delayed_blks -> m_metafile_resv_avail > > > > Makes sense. And I guess we still consider the reservations as "delayed > > allocations" in the sense that we subtract from fdblocks and put the > > same quantity into m_delalloc_blks. > > Yes. > > > > @@ -932,13 +932,17 @@ xrep_reap_metadir_fsblocks( > > > xfs_rmap_ino_bmbt_owner(&oinfo, sc->ip->i_ino, XFS_DATA_FORK); > > ... > > > > + /* > > > + * Resize the reservations so that we don't fail to expand the btree. > > > + */ > > > + return xfs_metafile_resv_init(sc->mp); > > > > I'm not sure this fixes the reservation correctly -- let's say we have a > > busted btree inode with a 100 fsblock reservation that was using 20 fsb. > > At mount, we'll have called xfs_dec_fdblocks for 100 - 20 = 80 fsb: > > > > hidden_space = target - used; > > xfs_dec_fdblocks(mp, hidden_space, true); > > > > Then we rebuild it with a more compact 18 fsb btree. Now we run > > xfs_metafile_resv_init and it thinks it needs to call xfs_dec_fdblocks > > for 100 - 18 = 82 fsb. But we already reserved 80 for this file, so the > > correct thing to do here is to xfs_dec_fdblocks 2fsb from fdblocks, not > > another 82. > > Hmm, yes. If only we had a good test for this :) You could populate an rt filesystem with a lot of small mappings, then use magic debug switches to xfs_repair to rebuild the rt btrees with half full leaf and node blocks, and then force online repair to rebuild it. Then it'll rebuild them with 75% full blocks. Come to think of it, IIRC online repair also has secret debug knobs for that kind of thing. --D