On Sun, Mar 20, 2022 at 09:43:38AM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > The comment for xfs_alloc_set_aside indicates that we want to set aside > enough space to handle a bmap btree split. The code, unfortunately, > hardcodes this to 4. > > This is incorrect, since file bmap btrees can be taller than that: > > xfs_db> btheight bmapbt -n 4294967295 -b 512 > bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node) > level 0: 4294967295 records, 330382100 blocks > level 1: 330382100 records, 25414008 blocks > level 2: 25414008 records, 1954924 blocks > level 3: 1954924 records, 150379 blocks > level 4: 150379 records, 11568 blocks > level 5: 11568 records, 890 blocks > level 6: 890 records, 69 blocks > level 7: 69 records, 6 blocks > level 8: 6 records, 1 block > 9 levels, 357913945 blocks total > > Or, for V5 filesystems: > > xfs_db> btheight bmapbt -n 4294967295 -b 1024 > bmapbt: worst case per 1024-byte block: 29 records (leaf) / 29 keyptrs (node) > level 0: 4294967295 records, 148102321 blocks > level 1: 148102321 records, 5106977 blocks > level 2: 5106977 records, 176103 blocks > level 3: 176103 records, 6073 blocks > level 4: 6073 records, 210 blocks > level 5: 210 records, 8 blocks > level 6: 8 records, 1 block > 7 levels, 153391693 blocks total > > Fix this by using the actual bmap btree maxlevel value for the > set-aside. We subtract one because the root is always in the inode and > hence never splits. > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx> > --- > fs/xfs/libxfs/xfs_alloc.c | 7 +++++-- > fs/xfs/libxfs/xfs_sb.c | 2 -- > fs/xfs/xfs_mount.c | 7 +++++++ > 3 files changed, 12 insertions(+), 4 deletions(-) > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > index b0678e96ce61..747b3e45303f 100644 > --- a/fs/xfs/libxfs/xfs_alloc.c > +++ b/fs/xfs/libxfs/xfs_alloc.c > @@ -107,13 +107,16 @@ xfs_prealloc_blocks( > * aside a few blocks which will not be reserved in delayed allocation. > * > * For each AG, we need to reserve enough blocks to replenish a totally empty > - * AGFL and 4 more to handle a potential split of the file's bmap btree. > + * AGFL and enough to handle a potential split of a file's bmap btree. > */ > unsigned int > xfs_alloc_set_aside( > struct xfs_mount *mp) > { > - return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4); > + unsigned int bmbt_splits; > + > + bmbt_splits = max(mp->m_bm_maxlevels[0], mp->m_bm_maxlevels[1]) - 1; > + return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + bmbt_splits); > } So right now I'm trying to understand why this global space set aside ever needed to take into account the space used by a single BMBT split. ISTR it was done back in 2006 because the ag selection code, alloc args and/or xfs_alloc_space_available() didn't take into account the BMBT space via args->minleft correctly to ensure that the AGF we select had enough space in it for both the data extent and the followup BMBT split. Hence the original SET ASIDE (which wasn't per AG) was just 8 blocks - 4 for the AGFL, 4 for the BMBT. The transaction reservation takes into account the space needed by BMBT splits so we don't over-commit global space on user allocation anymore, args->minleft takes it into account so we don't overcommit AG space on extent allocation, and we have the reserved block pool to handle situations were delalloc extents are fragmented more than the initial indirect block reservation that is taken with the delalloc extent reservation. So where/why is this needed anymore? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx