On Wed, Mar 23, 2022 at 10:26:14PM -0700, Darrick J. Wong wrote: > On Thu, Mar 24, 2022 at 07:48:56AM +1100, Dave Chinner wrote: > > On Sun, Mar 20, 2022 at 09:43:38AM -0700, Darrick J. Wong wrote: > > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > > The comment for xfs_alloc_set_aside indicates that we want to set aside > > > enough space to handle a bmap btree split. The code, unfortunately, > > > hardcodes this to 4. > > > > > > This is incorrect, since file bmap btrees can be taller than that: > > > > > > xfs_db> btheight bmapbt -n 4294967295 -b 512 > > > bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node) > > > level 0: 4294967295 records, 330382100 blocks > > > level 1: 330382100 records, 25414008 blocks > > > level 2: 25414008 records, 1954924 blocks > > > level 3: 1954924 records, 150379 blocks > > > level 4: 150379 records, 11568 blocks > > > level 5: 11568 records, 890 blocks > > > level 6: 890 records, 69 blocks > > > level 7: 69 records, 6 blocks > > > level 8: 6 records, 1 block > > > 9 levels, 357913945 blocks total > > > > > > Or, for V5 filesystems: > > > > > > xfs_db> btheight bmapbt -n 4294967295 -b 1024 > > > bmapbt: worst case per 1024-byte block: 29 records (leaf) / 29 keyptrs (node) > > > level 0: 4294967295 records, 148102321 blocks > > > level 1: 148102321 records, 5106977 blocks > > > level 2: 5106977 records, 176103 blocks > > > level 3: 176103 records, 6073 blocks > > > level 4: 6073 records, 210 blocks > > > level 5: 210 records, 8 blocks > > > level 6: 8 records, 1 block > > > 7 levels, 153391693 blocks total > > > > > > Fix this by using the actual bmap btree maxlevel value for the > > > set-aside. We subtract one because the root is always in the inode and > > > hence never splits. > > > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > > Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx> > > > --- > > > fs/xfs/libxfs/xfs_alloc.c | 7 +++++-- > > > fs/xfs/libxfs/xfs_sb.c | 2 -- > > > fs/xfs/xfs_mount.c | 7 +++++++ > > > 3 files changed, 12 insertions(+), 4 deletions(-) > > > > > > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > > index b0678e96ce61..747b3e45303f 100644 > > > --- a/fs/xfs/libxfs/xfs_alloc.c > > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > > @@ -107,13 +107,16 @@ xfs_prealloc_blocks( > > > * aside a few blocks which will not be reserved in delayed allocation. > > > * > > > * For each AG, we need to reserve enough blocks to replenish a totally empty > > > - * AGFL and 4 more to handle a potential split of the file's bmap btree. > > > + * AGFL and enough to handle a potential split of a file's bmap btree. > > > */ > > > unsigned int > > > xfs_alloc_set_aside( > > > struct xfs_mount *mp) > > > { > > > - return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4); > > > + unsigned int bmbt_splits; > > > + > > > + bmbt_splits = max(mp->m_bm_maxlevels[0], mp->m_bm_maxlevels[1]) - 1; > > > + return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + bmbt_splits); > > > } > > > > So right now I'm trying to understand why this global space set > > aside ever needed to take into account the space used by a single > > BMBT split. ISTR it was done back in 2006 because the ag selection > > code, alloc args and/or xfs_alloc_space_available() didn't take into > > account the BMBT space via args->minleft correctly to ensure that > > the AGF we select had enough space in it for both the data extent > > and the followup BMBT split. Hence the original SET ASIDE (which > > wasn't per AG) was just 8 blocks - 4 for the AGFL, 4 for the BMBT. > > > > The transaction reservation takes into account the space needed by > > BMBT splits so we don't over-commit global space on user allocation > > anymore, args->minleft takes it into account so we don't overcommit > > AG space on extent allocation, and we have the reserved block pool > > to handle situations were delalloc extents are fragmented more than > > the initial indirect block reservation that is taken with the > > delalloc extent reservation. > > > > So where/why is this needed anymore? > > Beats me. I started by reading the comment and thought "Gee, that's not > true of bmbts!" :( Yeah. Time and knowledge are showing through here. I've been running ENOSPC tests with the BMBT reservation part removed now for several hours and I've haven't seen any issues as a result removing it. I'll keep it running though to see if anything falls out, but my initial impression is that we don't obviously need it anymore. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx