Re: Possible io_uring related race leads to btrfs data csum mismatch

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Thu, 17 Aug 2023 09:05:56 +0800

On 2023/8/17 06:28, Jens Axboe wrote:
[...]

2) What's the .config you are using?

Pretty common config, no heavy debug options (KASAN etc).

Please just send the .config, I'd rather not have to guess. Things like
preempt etc may make a difference in reproducing this.

Sure, please see the attached config.gz

At least here, with a VM with 6 cores (host has 8C/16T), fast enough
storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around
1/100 to hit the error.

What does "unsafe cche mode" mean?

Libvirt cache option "unsafe"

Which is mostly ignoring flush/fua commands and fully rely on host fs
(in my case it's file backed) cache.

Gotcha

Is that write back caching enabled?
Write back caching with volatile write cache? For your device, can you
do:

$ grep . /sys/block/$dev/queue/*

Checking the fsstress verbose log against the failed file, it turns out
to be an io_uring write.

Any more details on what the write looks like?

For the involved file, it shows the following operations for the minimal
reproducible seed/-p/-n combination:

```
0/24: link d0/f2 d0/f3 0
0/29: fallocate(INSERT_RANGE) d0/f3 [276 2 0 0 176 481971]t 884736 585728 95
0/30: uring_write d0/f3[276 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
0/31: writev d0/f3[276 2 0 0 296 1457078] [709121,8,964] 0
0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 320
1457078] return 25, fallback to stat()
0/34: dwrite d0/f3[276 2 308134 1763236 320 1457078] [589824,16384] 0
0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 496
1457078] return 25, fallback to stat()
0/38: dwrite d0/f3[276 2 308134 1763236 496 1457078] [2084864,36864] 0
0/40: fallocate(ZERO_RANGE) d0/f3 [276 2 308134 1763236 688 2809139]t
3512660 81075 0
0/43: splice d0/f5[289 1 0 0 1872 2678784] [552619,59420] -> d0/f3[276 2
308134 1763236 856 3593735] [5603798,59420] 0
0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [276 1 308134 1763236 976
5663218]t 1361821 480392 0
0/49: clonerange d0/f3[276 1 308134 1763236 856 5663218] [2461696,53248]
-> d0/f5[289 1 0 0 1872 2678784] [942080,53248]
```

And just to be sure, this is not mixing dio and buffered, right?

I'd say it's mixing, there are dwrite() and writev() for the same file,
but at least not overlapping using this particular seed, nor they are
concurrent (all inside the same process sequentially).

But considering if only uring_write is disabled, then no more reproduce,
thus there must be some untested btrfs path triggered by uring_write.

However I didn't see any io_uring related callback inside btrfs code,
any advice on the io_uring part would be appreciated.

io_uring doesn't do anything special here, it uses the normal page cache
read/write parts for buffered IO. But you may get extra parallellism
with io_uring here. For example, with the buffered write that this most
likely is, libaio would be exactly the same as a pwrite(2) on the file.
If this would've blocked, io_uring would offload this to a helper
thread. Depending on the workload, you could have multiple of those in
progress at the same time.

My biggest concern is, would io_uring modify the page when it's still
under writeback?

No, of course not. Like I mentioned, io_uring doesn't do anything that
the normal read/write path isn't already doing - it's using the same
->read_iter() and ->write_iter() that everything else is, there's no
page cache code in io_uring.

In that case, it's going to cause csum mismatch as btrfs relies on the
page under writeback to be unchanged.

Sure, I'm aware of the stable page requirements.

See my followup email as well on a patch to test as well.

Applied and tested, using "-p 10 -n 1000" as fsstress workload, failed
at 23rd run.

Thanks,
Qu
Attachment:
config.gz

Description: application/gzip