Hi, I've been running into occasional WARNs related to allocating a block to hold the new btree that XFS is attempting to create when calling this function. The problem is sporadic -- once every 10-40 days and a different system each time. The process that's triggering the problem is dd punching a hole into file via direct I/O. It's doing this as part of a watchdog process to ensure that the system remains able to issue read and write requests. The direct I/O is an attempt to avoid reading/writing cached data from this process. I'm hardly an expert; however, after some digging it appears that the direct I/O path for this particular workload is more susceptible to the problem because its tp->t_firstblock is always set to a block in an existing AG, while the rest of the I/O on this filesystem goes through the page cache and uses the delayed allocation mechanism by default. (IOW, t_firstblock is NULLFSBLOCK most of the time.) Looking at the version history and mailing list archives for this particular bit of code seem to indicate that this particular function hasn't had a ton of churn. The XFS_ALLOCTYPE_START_BNO vs XFS_ALLOCTYPE_NEAR_BNO bits seem not to have changed much since they were authored in the 90s. I haven't yet been able to get a kdump to look into this WARN in more detail, but I was curious if the rationale for using XFS_ALLOCTYPE_NEAR_BNO still held true for modern linux based XFS? It seemed like one reason for keeping the bmap and the inode in the same AG might be that with 32-bit block pointers in an inode there wouldn't be space to store the AG and the block if the btree were allocated in a different AG. It also seemed like there were lock order concerns when iterating over multiple AGs. However, linux is using 64-bit block pointers in the inode now and the XFS_ALLOCTYPE_START_BNO case in xfs_alloc_vextent() seems to try to ensure that it never considers an AG that's less than the agno for the fsbno passed in via args. Would something like this: diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 4dccd4d90622..5d949ac1ecae 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -664,6 +664,13 @@ xfs_bmap_extents_to_btree( if (error) goto out_root_realloc; + if (args.fsbno == NULLFSBLOCK && args.type == XFS_ALLOCTYPE_NEAR_BNO) { + args.type = XFS_ALLOCTYPE_START_BNO; + error = xfs_alloc_vextent(&args); + if (error) + goto out_root_realloc; + } + if (WARN_ON_ONCE(args.fsbno == NULLFSBLOCK)) { error = -ENOSPC; goto out_root_realloc; Or this: diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 4dccd4d90622..94e4ecb75561 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -647,14 +647,10 @@ xfs_bmap_extents_to_btree( args.tp = tp; args.mp = mp; xfs_rmap_ino_bmbt_owner(&args.oinfo, ip->i_ino, whichfork); + args.type = XFS_ALLOCTYPE_START_BNO; if (tp->t_firstblock == NULLFSBLOCK) { - args.type = XFS_ALLOCTYPE_START_BNO; args.fsbno = XFS_INO_TO_FSB(mp, ip->i_ino); - } else if (tp->t_flags & XFS_TRANS_LOWMODE) { - args.type = XFS_ALLOCTYPE_START_BNO; - args.fsbno = tp->t_firstblock; } else { - args.type = XFS_ALLOCTYPE_NEAR_BNO; args.fsbno = tp->t_firstblock; } args.minlen = args.maxlen = args.prod = 1; be a reasonable way to address the WARN? Or does this open a box of problems that obvious to the experienced, but just subtle enough to elude the unfamiliar? I ask because these filesystems are pretty busy on a day to day basis and the path where t_firstblock is NULLFSBLOCK is never hitting this problem. The overall workload is a btree based database. Lots of random reads and writes to many files that all live in the same directory. While I don't have a full RCA, the circumstantial evidence seems to suggest that letting the allocation be satisfied by more than one AG will result in many fewer failures in xfs_bmap_extents_to_btree. Assuming of course, that it's actually a safe and sane thing to do. Many thanks, -K Including additional diagnostic output that was requested by the FAQ in case it's helpful: Kernel version: various 5.4 LTS versions xfsprogs version: 5.3.0 number of CPUs: 48 Storage layout: 4x 7TB NVMe drives in a 28T LVM stripe with 16k width xfs-info: meta-data=/dev/mapper/db-vol isize=512 agcount=32, agsize=228849020 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=7323168640, imaxpct=5 = sunit=4 swidth=16 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=4 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 An example warning: ------------[ cut here ]------------ WARNING: CPU: 4 PID: 115756 at fs/xfs/libxfs/xfs_bmap.c:716 xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs] Modules linked in: btrfs xor zstd_compress raid6_pq ufs msdos softdog binfmt_misc udp_diag tcp_diag inet_diag xfs libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua p CPU: 4 PID: 115756 Comm: dd Not tainted 5.4.139 #2 Hardware name: Amazon EC2 i3en.12xlarge/, BIOS 1.0 10/16/2017 RIP: 0010:xfs_bmap_extents_to_btree+0x3dc/0x610 [xfs] Code: 00 00 8b 85 00 ff ff ff 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 c7 45 9c 10 00 00 00 48 89 95 60 ff ff ff e9 5a fd ff ff <0f> 0b c7 85 00 ff ff ff e4 ff ff ff 8b 9d RSP: 0018:ffffa948115fb740 EFLAGS: 00010246 RAX: ffffffffffffffff RBX: ffff8ffec2274048 RCX: 00000000001858ab RDX: 000000000017e467 RSI: 0000000000000000 RDI: ffff904ea1726000 RBP: ffffa948115fb870 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000000 R11: 000000000017e467 R12: ffff8ff22bcdd6a8 R13: ffff904ea9f85000 R14: ffff8ffec2274000 R15: ffff904e476a4380 FS: 00007f5625e89580(0000) GS:ffff904ebb300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055bd5952c000 CR3: 000000011d6f0004 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: ? xfs_alloc_update_counters.isra.0+0x3d/0x50 [xfs] ? xfs_trans_log_buf+0x30/0x80 [xfs] ? xfs_alloc_log_agf+0x73/0x100 [xfs] xfs_bmap_add_extent_hole_real+0x7d9/0x8f0 [xfs] xfs_bmapi_allocate+0x2a8/0x2d0 [xfs] ? kmem_zone_alloc+0x85/0x140 [xfs] xfs_bmapi_write+0x3a9/0x5f0 [xfs] xfs_iomap_write_direct+0x293/0x3c0 [xfs] xfs_file_iomap_begin+0x4d2/0x5c0 [xfs] iomap_apply+0x68/0x160 ? iomap_dio_bio_actor+0x3d0/0x3d0 iomap_dio_rw+0x2c1/0x450 ? iomap_dio_bio_actor+0x3d0/0x3d0 xfs_file_dio_aio_write+0x103/0x2e0 [xfs] ? xfs_file_dio_aio_write+0x103/0x2e0 [xfs] xfs_file_write_iter+0x99/0xe0 [xfs] new_sync_write+0x125/0x1c0 __vfs_write+0x29/0x40 vfs_write+0xb9/0x1a0 ksys_write+0x67/0xe0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x57/0x190 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f5625da71e7 Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 RSP: 002b:00007ffd3916c2a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000004000 RCX: 00007f5625da71e7 RDX: 0000000000004000 RSI: 000055bd59529000 RDI: 0000000000000001 RBP: 000055bd59529000 R08: 000055bd595280f0 R09: 000000000000007c R10: 000055bd59529000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000004000 R15: 000055bd59529000 ---[ end trace b26426a6b66a298e ]--- XFS (dm-1): Internal error xfs_trans_cancel at line 1053 of file fs/xfs/xfs_trans.c. Caller xfs_iomap_write_direct+0x1fb/0x3c0 [xfs] The xfs_db freesp report after the problem occurred. (N.B. it was a few hours before I was able to get to this machine to investigate) xfs_db -r -c 'freesp -a 47 -s' /dev/mapper/db-vol from to extents blocks pct 1 1 48 48 0.00 2 3 119 303 0.02 4 7 46 250 0.01 8 15 22 255 0.01 16 31 17 374 0.02 32 63 16 728 0.04 64 127 9 997 0.05 128 255 149 34271 1.83 256 511 7 2241 0.12 512 1023 4 2284 0.12 1024 2047 1 1280 0.07 2048 4095 1 3452 0.18 1048576 2097151 1 1825182 97.52 total free extents 440 total free blocks 1871665 average free extent size 4253.78