Hi David, Thanks for the response. On Thu, Jan 06, 2022 at 12:01:23PM +1100, Dave Chinner wrote: > On Tue, Jan 04, 2022 at 11:10:52PM -0800, Krister Johansen wrote: > > WARNING: CPU: 4 PID: 115756 at fs/xfs/libxfs/xfs_bmap.c:716 xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs] > > RIP: 0010:xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs] > > Call Trace: > > xfs_bmap_add_extent_hole_real+0x7d9/0x8f0 [xfs] > > xfs_bmapi_allocate+0x2a8/0x2d0 [xfs] > > xfs_bmapi_write+0x3a9/0x5f0 [xfs] > > xfs_iomap_write_direct+0x293/0x3c0 [xfs] > > xfs_file_iomap_begin+0x4d2/0x5c0 [xfs] > > iomap_apply+0x68/0x160 > > iomap_dio_rw+0x2c1/0x450 > > xfs_file_dio_aio_write+0x103/0x2e0 [xfs] > > xfs_file_write_iter+0x99/0xe0 [xfs] > > new_sync_write+0x125/0x1c0 > > __vfs_write+0x29/0x40 > > vfs_write+0xb9/0x1a0 > > ksys_write+0x67/0xe0 > > __x64_sys_write+0x1a/0x20 > > do_syscall_64+0x57/0x190 > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Which is indicative of a multi-extent allocation (i.e. data extent + > BMBT indexing btree block allocations) not selecting an AG that has > enough space for both data and BMBT blocks to be allocated. > > That's likely because args.total was not set correctly before AG > selection was made. This exact sort of bug can be seen in the fix > for recent error injection debug code in commit 6e8bd39d7227 ("xfs: > Initialize xfs_alloc_arg->total correctly when allocating minlen > extents"). You're not running a CONFIG_XFS_DEBUG=y kernel are you? I am not running with CONFIG_XFS_DEBUG, no. > > while the rest of the I/O on this filesystem goes through > > the page cache and uses the delayed allocation mechanism by default. > > (IOW, t_firstblock is NULLFSBLOCK most of the time.) > > Has nothing to do with delayed allocation - the physical allocation > that occurs in writeback will do exactly the same thing as direct IO > allocation.... It must be similar, but not identical. Though I think you're right that I've confused myself about what's happening here. The dd process that is triggering this warning is an incredibly small fraction of all the I/O that occurs on this filesystem, and yet it accounts for all of the instances of the warning that I can find. I looked in xfs_bmapi_convert_delalloc and saw that xfs_trans_alloc allocated a new transaction but didn't touch t_firstblock. What I neglected to notice, though, was that xfs_bmap_btalloc and xfs_bmbt_alloc_block could modify t_firstblock, and that this code path could get there via the xfs_btree_insert calls in xfs_bmap_add_extent_delay_real for the latter and the xfs_bmap_alloc path for the former. I missed the difference in the alloc_args accounting between the delayed and real allocation paths and assumed that t_firstblock was the thing making the difference between the regular and direct I/O cases. > > However, linux is using 64-bit block > > pointers in the inode now and the XFS_ALLOCTYPE_START_BNO case in > > xfs_alloc_vextent() seems to try to ensure that it never considers an AG > > that's less than the agno for the fsbno passed in via args. > > Because otherwise allocations ABBA deadlock on AG locking. Sorry, what I'm really trying to ask is: are there still cases in XFS where using XFS_ALLOCTYPE_START_BNO can give you this kind of deadlock? There's some deadlock avoidance in the START_BNO implementation, but there are plenty of places where XFS_ALLOCTYPE_NEAR_BNO is still used if t_firstblock is not NULLFSBLOCK. I'm trying to work out if that code still exists because of concerns about deadlocks, or if its an attempt to limit the number of AGs searched instead. (Or some other policy choice, even.) > > be a reasonable way to address the WARN? Or does this open a box of > > problems that obvious to the experienced, but just subtle enough to > > elude the unfamiliar? > > No, yes, and yes. Thank you humoring my questions nonetheless. > > The xfs_db freesp report after the problem occurred. (N.B. it was a few > > hours before I was able to get to this machine to investigate) > > > > xfs_db -r -c 'freesp -a 47 -s' /dev/mapper/db-vol > > from to extents blocks pct > > 1 1 48 48 0.00 > > 2 3 119 303 0.02 > > 4 7 46 250 0.01 > > 8 15 22 255 0.01 > > 16 31 17 374 0.02 > > 32 63 16 728 0.04 > > 64 127 9 997 0.05 > > 128 255 149 34271 1.83 > > 256 511 7 2241 0.12 > > 512 1023 4 2284 0.12 > > 1024 2047 1 1280 0.07 > > 2048 4095 1 3452 0.18 > > 1048576 2097151 1 1825182 97.52 > > total free extents 440 > > total free blocks 1871665 > > average free extent size 4253.78 > > So 1,871,665 of 228,849,020 blocks free in the AG. That's 99.2% > full, so it's extremely likely you are hitting a full AG condition. > > /me goes and looks at xfs_iomap_write_direct().... > > .... and notices that it passes "0" as the total allocation block > count, which means it isn't reserving space in the AG for both the > data extent and the BMBT blocks... > > ... and several other xfs_bmapi_write() callers have the same > issue... > > Ok, let me spend a bit more looking into this in more depth, but it > looks like the problem is at the xfs_bmapi_write() caller level, not > deep in the allocator itself. At least on 5.4 xfs_bmapi_write is still passing resblks instead of zero, which is computed in xfs_iomap_write_direct. Related to your comment about alloc_args.total, I did a bit of tracing of the xfs_alloc tracepoints on my system and found that total seems to be getting set in both cases, but that a) it's actually a larger value for directio; and b) in the buffered write case the code is requesting more blocks at one time which causes a larger allocation to occur. I'm not certain, but wondered if this could be causing us to select an AG with more space by luck. directio: dd-102229 [005] .... 4969662.383215: xfs_alloc_exact_done: dev 253:1 agno 0 agbno 14240 minlen 4 maxlen 4 mod 0 prod 1 minleft 2 total 8 alignment 1 minalignslop 3 len 4 type THIS_BNO otype THIS_BNO wasdel 0 wasfromfl 0 resv 0 datatype 0x9 firstblock 0xffffffffffffffff dd-102229 [005] .... 4969662.383217: <stack trace> => xfs_alloc_ag_vextent_exact+0x2b4/0x2d0 [xfs] => xfs_alloc_ag_vextent+0x13b/0x150 [xfs] => xfs_alloc_vextent+0x2bc/0x550 [xfs] => xfs_bmap_btalloc+0x461/0x940 [xfs] => xfs_bmap_alloc+0x34/0x40 [xfs] => xfs_bmapi_allocate+0xdc/0x2d0 [xfs] => xfs_bmapi_write+0x3a9/0x5f0 [xfs] => xfs_iomap_write_direct+0x293/0x3c0 [xfs] => xfs_file_iomap_begin+0x4d2/0x5c0 [xfs] => iomap_apply+0x68/0x160 => iomap_dio_rw+0x2c1/0x450 => xfs_file_dio_aio_write+0x103/0x2e0 [xfs] => xfs_file_write_iter+0x99/0xe0 [xfs] => new_sync_write+0x125/0x1c0 => __vfs_write+0x29/0x40 => vfs_write+0xb9/0x1a0 => ksys_write+0x67/0xe0 => __x64_sys_write+0x1a/0x20 => do_syscall_64+0x57/0x190 => entry_SYSCALL_64_after_hwframe+0x44/0xa9 buffered write + fsync: dd-109921 [010] .... 4972814.844429: xfs_alloc_exact_done: dev 253:1 agno 0 agbno 21280 minlen 16 maxlen 16 mod 0 prod 1 minleft 2 total 4 alignment 1 minalignslop 3 len 16 type THIS_BNO otype THIS_BNO wasdel 1 wasfromfl 0 resv 0 datatype 0x9 firstblock 0xffffffffffffffff dd-109921 [010] .... 4972814.844433: <stack trace> => xfs_alloc_ag_vextent_exact+0x2b4/0x2d0 [xfs] => xfs_alloc_ag_vextent+0x13b/0x150 [xfs] => xfs_alloc_vextent+0x2bc/0x550 [xfs] => xfs_bmap_btalloc+0x461/0x940 [xfs] => xfs_bmap_alloc+0x34/0x40 [xfs] => xfs_bmapi_allocate+0xdc/0x2d0 [xfs] => xfs_bmapi_convert_delalloc+0x26f/0x4b0 [xfs] => xfs_map_blocks+0x15a/0x3f0 [xfs] => xfs_do_writepage+0x118/0x420 [xfs] => write_cache_pages+0x1ae/0x4b0 => xfs_vm_writepages+0x6a/0xa0 [xfs] => do_writepages+0x43/0xd0 => __filemap_fdatawrite_range+0xd5/0x110 => file_write_and_wait_range+0x74/0xc0 => xfs_file_fsync+0x5d/0x230 [xfs] => vfs_fsync_range+0x49/0x80 => do_fsync+0x3d/0x70 => __x64_sys_fsync+0x14/0x20 => do_syscall_64+0x57/0x190 => entry_SYSCALL_64_after_hwframe+0x44/0xa9 I'm not sure if this is useful, but if there's something else that you'd like me to instrument that would be, let me know and I'll see what I can pull together. Thanks again, -K