Hi Dave, On Thu, Jan 06, 2022 at 12:01:23PM +1100, Dave Chinner wrote: > On Tue, Jan 04, 2022 at 11:10:52PM -0800, Krister Johansen wrote: > > Hi, > > I've been running into occasional WARNs related to allocating a block to > > hold the new btree that XFS is attempting to create when calling this > > function. The problem is sporadic -- once every 10-40 days and a > > different system each time. > > The warning is: > > > WARNING: CPU: 4 PID: 115756 at fs/xfs/libxfs/xfs_bmap.c:716 xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs] > > RIP: 0010:xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs] > > Call Trace: > > xfs_bmap_add_extent_hole_real+0x7d9/0x8f0 [xfs] > > xfs_bmapi_allocate+0x2a8/0x2d0 [xfs] > > xfs_bmapi_write+0x3a9/0x5f0 [xfs] > > xfs_iomap_write_direct+0x293/0x3c0 [xfs] > > xfs_file_iomap_begin+0x4d2/0x5c0 [xfs] > > iomap_apply+0x68/0x160 > > iomap_dio_rw+0x2c1/0x450 > > xfs_file_dio_aio_write+0x103/0x2e0 [xfs] > > xfs_file_write_iter+0x99/0xe0 [xfs] > > new_sync_write+0x125/0x1c0 > > __vfs_write+0x29/0x40 > > vfs_write+0xb9/0x1a0 > > ksys_write+0x67/0xe0 > > __x64_sys_write+0x1a/0x20 > > do_syscall_64+0x57/0x190 > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > <snip> > So 1,871,665 of 228,849,020 blocks free in the AG. That's 99.2% > full, so it's extremely likely you are hitting a full AG condition. > > /me goes and looks at xfs_iomap_write_direct().... > > .... and notices that it passes "0" as the total allocation block > count, which means it isn't reserving space in the AG for both the > data extent and the BMBT blocks... > > ... and several other xfs_bmapi_write() callers have the same > issue... > > Ok, let me spend a bit more looking into this in more depth, but it > looks like the problem is at the xfs_bmapi_write() caller level, not > deep in the allocator itself. I noodled on this a bit more and have another hypothesis. Feel free to tell me that this one is just as nuts (or more). However, after thinking through your comments about the accounting, and reviewing some other patches and threads for similar problems: https://lore.kernel.org/linux-xfs/20171127202434.43125-4-bfoster@xxxxxxxxxx/ https://lore.kernel.org/linux-xfs/20171207185810.48757-1-bfoster@xxxxxxxxxx/ https://lore.kernel.org/linux-xfs/20190327145000.10756-1-bfoster@xxxxxxxxxx/ I wondered if perhaps the problem was related to other problems in xfs_alloc_fix_freelist. Taking inspiration from some of the fixes that Brian made here, it looks like there's a possibility of the freelist refill code grabbing blocks that were assumed to be available by previous checks in that function. For example, using some values from a successful trace of a directio allocation: dd-102227 [027] .... 4969662.381037: xfs_alloc_near_first: dev 25 3:1 agno 0 agbno 5924 minlen 4 maxlen 4 mod 0 prod 1 minleft 1 total 8 alignment 4 minalignslop 0 len 4 type NEAR_BNO otype START_BNO wasdel 0 wasfromfl 0 resv 0 datatype 0x9 firstblock 0xffffffffffffffff dd-102227 [027] .... 4969662.381047: xfs_alloc_near_first: dev 25 3:1 agno 0 agbno 5921 minlen 1 maxlen 1 mod 0 prod 1 minleft 0 total 0 alignment 1 minalignslop 0 len 1 type NEAR_BNO otype NEAR_BNO wasdel 0 wasfromfl 0 resv 0 datatype 0x0 firstblock 0x1724 [first is the bmap alloc, second is the extents_to_btree alloc] if agflcount = min(pagf_flcount, min_free) agflcount = min(3, 8) and available = pagf_freeblks + agflcount - reservation - min_free - minleft available = 14 + 3 - 0 - 8 - 1 available = 8 which satisfies the total from the first allocation request; however, if this code path needs to refill the freelists and the ag btree is full because a lot of space is allocated and not much is free, then inserts here may trigger rebalances. Usage might look something like this: pagf_freeblks = 14 allocate 5 blocks to fill freelist pags_freeblks = 9 fill of freelist triggers split that requires 4 nodes next iteration allocates 4 blocks to refill freelist pages_freeblks = 5 refill requires rebalance and another node next iteration allocates 1 block to refill freelist pages_freeblks = 4 freelist filled; return to caller caller consumes remaining 4 blocks for bmap allocation pages_freeblks = 0 no blocks available for xfs_bmap_extents_to_btree I'm not sure if this is possible, but I thought I'd mention it since Brian's prior work here got me thinking about it. If this does sound plausible, what do you think about re-validating the space_available conditions after refilling the freelist? Something like: diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 353e53b..d235744 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2730,6 +2730,16 @@ xfs_alloc_fix_freelist( } } xfs_trans_brelse(tp, agflbp); + + /* + * Freelist refill may have consumed blocks from pagf_freeblks. Ensure + * that this allocation still meets its requested constraints by + * revalidating the min_freelist and space_available checks. + */ + need = xfs_alloc_min_freelist(mp, pag); + if (!xfs_alloc_space_available(args, need, flags)) + goto out_agbp_relse; + args->agbp = agbp; return 0; perhaps? Thanks again, -K