On 2023/8/20 07:59, Qu Wenruo wrote:
Hi Jens I tried more on my side to debug the situation, and found a very weird write behavior: Some unexpected direct IO happened, without corresponding fsstress workload. The workload is: $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress Which I can reliably reproduce the problem locally, around 1/50 possibility. In my particular case, it results data corruption at root 5 inode 283 offset 8192. Then I added some trace points for the following functions: - btrfs_do_write_iter() Two trace points, one before btrfs_direct_write(), and one before btrfs_buffered_write(), outputting the aligned and unaligned write range, root/inode number, type of the write (buffered or direct). - btrfs_finish_one_ordered() This is where btrfs inserts its ordered extent into the subvolume tree. This happens when a range of pages finishes its writeback. Then here comes the fsstress log for inode 283 (no btrfs root number): 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] [307200,0] 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] [1058819,0] 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736 585728 95 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320 1457078] return 25, fallback to stat() 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320 1457078] return 25, fallback to stat() 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496 1457078] return 25, fallback to stat() 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t 3512660 81075 0 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2 308134 1763236 856 3593735] [5603798,59420] 0 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976 5663218]t 1361821 480392 0 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248] -> d0/f5[293 1 0 0 1872 2678784] [942080,53248] Note one thing, there is no direct/buffered write into inode 283 offset 8192. But from the trace events for root 5 inode 283: btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644) len=90112(88327) btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622) len=61440(56456) btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121) len=12288(7712) btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192) len=73728(73728) <<<<< btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824) len=16384(16384) btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728 btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384 btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864) len=36864(36864) btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864 btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000) len=61440(60139) btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798) len=61440(59420) btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112 btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288 btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440 btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096 btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440 btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440 Note that phantom direct IO call, which is in the corrupted range. If paired with fsstress, that phantom write happens between the two operations: 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
Just to be more accurate, there is a 0/33 operation, which is: 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320 1457078] return 25, fallback to stat() 0/33: awrite - io_getevents failed -4 The failed one doesn't have inode number thus it didn't get caught by grep. Return value -4 means -INTR, not sure who sent the interruption. But if this interruption happens before the IO finished, we can call free() on the buffer, and if we're unlucky enough, the freed memory can be re-allocated for some other usage, thus modifying the pages before the writeback finished. I think this is the direct cause of the data corruption, page modification before direct IO finished. But unfortunately I still didn't get why the interruption can happen, nor how can we handle such interruption? (I guess just retry?) Thanks, Qu
I'll keep digging, but the phantom writes which is not properly loggeg from fsstress is already a concern to me. Or maybe I'm missing some fixes in fsstress? Thanks, Qu