Re: Possible io_uring related race leads to btrfs data csum mismatch

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2023/8/20 07:59, Qu Wenruo wrote:
Hi Jens

I tried more on my side to debug the situation, and found a very weird
write behavior:

     Some unexpected direct IO happened, without corresponding
     fsstress workload.

The workload is:

     $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress

Which I can reliably reproduce the problem locally, around 1/50
possibility.
In my particular case, it results data corruption at root 5 inode 283
offset 8192.

Then I added some trace points for the following functions:

- btrfs_do_write_iter()
   Two trace points, one before btrfs_direct_write(), and one
   before btrfs_buffered_write(), outputting the aligned and unaligned
   write range, root/inode number, type of the write (buffered or
   direct).

- btrfs_finish_one_ordered()
   This is where btrfs inserts its ordered extent into the subvolume
   tree.
   This happens when a range of pages finishes its writeback.

Then here comes the fsstress log for inode 283 (no btrfs root number):

0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
[307200,0]
0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
[1058819,0]
0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0
0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736
585728 95
0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320
1457078] return 25, fallback to stat()
0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320
1457078] return 25, fallback to stat()
0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496
1457078] return 25, fallback to stat()
0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0
0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0
0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t
3512660 81075 0
0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2
308134 1763236 856 3593735] [5603798,59420] 0
0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976
5663218]t 1361821 480392 0
0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248]
-> d0/f5[293 1 0 0 1872 2678784] [942080,53248]

Note one thing, there is no direct/buffered write into inode 283 offset
8192.

But from the trace events for root 5 inode 283:

  btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644)
len=90112(88327)
  btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622)
len=61440(56456)
  btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121)
len=12288(7712)

  btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192)
len=73728(73728) <<<<<

  btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824)
len=16384(16384)
  btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728
  btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384
  btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864)
len=36864(36864)
  btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864
  btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000)
len=61440(60139)
  btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798)
len=61440(59420)
  btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112
  btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288
  btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440
  btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096
  btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440
  btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440

Note that phantom direct IO call, which is in the corrupted range.

If paired with fsstress, that phantom write happens between the two
operations:

0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0

Just to be more accurate, there is a 0/33 operation, which is:

0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320
1457078] return 25, fallback to stat()
0/33: awrite - io_getevents failed -4

The failed one doesn't have inode number thus it didn't get caught by grep.

Return value -4 means -INTR, not sure who sent the interruption.
But if this interruption happens before the IO finished, we can call
free() on the buffer, and if we're unlucky enough, the freed memory can
be re-allocated for some other usage, thus modifying the pages before
the writeback finished.

I think this is the direct cause of the data corruption, page
modification before direct IO finished.

But unfortunately I still didn't get why the interruption can happen,
nor how can we handle such interruption?
(I guess just retry?)

Thanks,
Qu

I'll keep digging, but the phantom writes which is not properly loggeg
from fsstress is already a concern to me.

Or maybe I'm missing some fixes in fsstress?

Thanks,
Qu




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux