Re: Possible io_uring related race leads to btrfs data csum mismatch

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Thu, 17 Aug 2023 05:46:18 +0800

On 2023/8/16 22:33, Jens Axboe wrote:
On 8/16/23 12:52 AM, Qu Wenruo wrote:
Hi,

Recently I'm digging into a very rare failure during btrfs/06[234567],
where btrfs scrub detects unrepairable data corruption.

After days of digging, I have a much smaller reproducer:

```
fail()
{
         echo "!!! FAILED !!!"
         exit 1
}

workload()
{
         mkfs.btrfs -f -m single -d single --csum sha256 $dev1
         mount $dev1 $mnt
     # There are around 10 more combinations with different
         # seed and -p/-n parameters, but this is the smallest one
     # I found so far.
     $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt
     umount $mnt
     btrfs check --check-data-csum $dev1 || fail
}
runtime=1024
for (( i = 0; i < $runtime; i++ )); do
         echo "=== $i / $runtime ==="
         workload
done
```

Tried to reproduce this, both on a vm and on a real host, and no luck so
far. I've got a few followup questions as your report is missing some
important info:

You may want to try much higher -p/-n numbers.

For verification purpose, I normally go with -p 10 -n 10000, which has a
much higher chance to hit, but definitely too noisy for debug.

I just tried a run with "$fsstress -p 10 -n 10000 -w -d $mnt" as the
workload, it failed at 21/1024.

1) What kernel are you running?

David's misc-next branch, aka, lastest upstream tags plus some btrfs
patches for the next merge window.

Although I have some internal reports showing this problem quite some
time ago.

2) What's the .config you are using?

Pretty common config, no heavy debug options (KASAN etc).

At least here, with a VM with 6 cores (host has 8C/16T), fast enough
storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around
1/100 to hit the error.

What does "unsafe cche mode" mean?

Libvirt cache option "unsafe"

Which is mostly ignoring flush/fua commands and fully rely on host fs
(in my case it's file backed) cache.

Is that write back caching enabled?
Write back caching with volatile write cache? For your device, can you
do:

$ grep . /sys/block/$dev/queue/*

Checking the fsstress verbose log against the failed file, it turns out
to be an io_uring write.

Any more details on what the write looks like?

For the involved file, it shows the following operations for the minimal
reproducible seed/-p/-n combination:

```
0/24: link d0/f2 d0/f3 0
0/29: fallocate(INSERT_RANGE) d0/f3 [276 2 0 0 176 481971]t 884736 585728 95
0/30: uring_write d0/f3[276 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
0/31: writev d0/f3[276 2 0 0 296 1457078] [709121,8,964] 0
0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 320
1457078] return 25, fallback to stat()
0/34: dwrite d0/f3[276 2 308134 1763236 320 1457078] [589824,16384] 0
0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 496
1457078] return 25, fallback to stat()
0/38: dwrite d0/f3[276 2 308134 1763236 496 1457078] [2084864,36864] 0
0/40: fallocate(ZERO_RANGE) d0/f3 [276 2 308134 1763236 688 2809139]t
3512660 81075 0
0/43: splice d0/f5[289 1 0 0 1872 2678784] [552619,59420] -> d0/f3[276 2
308134 1763236 856 3593735] [5603798,59420] 0
0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [276 1 308134 1763236 976
5663218]t 1361821 480392 0
0/49: clonerange d0/f3[276 1 308134 1763236 856 5663218] [2461696,53248]
-> d0/f5[289 1 0 0 1872 2678784] [942080,53248]
```

And with uring_write disabled in fsstress, I have no longer reproduced
the csum mismatch, even with much larger -n and -p parameters.

Is it more likely to reproduce with larger -n/-p in general?

Yes, but I use that specific combination as the minimal reproducer for
debug purposes.

However I didn't see any io_uring related callback inside btrfs code,
any advice on the io_uring part would be appreciated.

io_uring doesn't do anything special here, it uses the normal page cache
read/write parts for buffered IO. But you may get extra parallellism
with io_uring here. For example, with the buffered write that this most
likely is, libaio would be exactly the same as a pwrite(2) on the file.
If this would've blocked, io_uring would offload this to a helper
thread. Depending on the workload, you could have multiple of those in
progress at the same time.

My biggest concern is, would io_uring modify the page when it's still
under writeback?
In that case, it's going to cause csum mismatch as btrfs relies on the
page under writeback to be unchanged.

Thanks,
Qu