Re: Possible io_uring related race leads to btrfs data csum mismatch

Jens Axboe <axboe@xxxxxxxxx> · Wed, 16 Aug 2023 16:28:25 -0600

On 8/16/23 3:46 PM, Qu Wenruo wrote:
> 
> 
> On 2023/8/16 22:33, Jens Axboe wrote:
>> On 8/16/23 12:52 AM, Qu Wenruo wrote:
>>> Hi,
>>>
>>> Recently I'm digging into a very rare failure during btrfs/06[234567],
>>> where btrfs scrub detects unrepairable data corruption.
>>>
>>> After days of digging, I have a much smaller reproducer:
>>>
>>> ```
>>> fail()
>>> {
>>>          echo "!!! FAILED !!!"
>>>          exit 1
>>> }
>>>
>>> workload()
>>> {
>>>          mkfs.btrfs -f -m single -d single --csum sha256 $dev1
>>>          mount $dev1 $mnt
>>>      # There are around 10 more combinations with different
>>>          # seed and -p/-n parameters, but this is the smallest one
>>>      # I found so far.
>>>      $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt
>>>      umount $mnt
>>>      btrfs check --check-data-csum $dev1 || fail
>>> }
>>> runtime=1024
>>> for (( i = 0; i < $runtime; i++ )); do
>>>          echo "=== $i / $runtime ==="
>>>          workload
>>> done
>>> ```
>>
>> Tried to reproduce this, both on a vm and on a real host, and no luck so
>> far. I've got a few followup questions as your report is missing some
>> important info:
> 
> You may want to try much higher -p/-n numbers.
> 
> For verification purpose, I normally go with -p 10 -n 10000, which has a
> much higher chance to hit, but definitely too noisy for debug.
> 
> I just tried a run with "$fsstress -p 10 -n 10000 -w -d $mnt" as the
> workload, it failed at 21/1024.

OK I'll try that.

>> 1) What kernel are you running?
> 
> David's misc-next branch, aka, lastest upstream tags plus some btrfs
> patches for the next merge window.
> 
> Although I have some internal reports showing this problem quite some
> time ago.

That's what I was getting at, if it was new or not.

>> 2) What's the .config you are using?
> 
> Pretty common config, no heavy debug options (KASAN etc).

Please just send the .config, I'd rather not have to guess. Things like
preempt etc may make a difference in reproducing this.

>>> At least here, with a VM with 6 cores (host has 8C/16T), fast enough
>>> storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around
>>> 1/100 to hit the error.
>>
>> What does "unsafe cche mode" mean?
> 
> Libvirt cache option "unsafe"
> 
> Which is mostly ignoring flush/fua commands and fully rely on host fs
> (in my case it's file backed) cache.

Gotcha

>> Is that write back caching enabled?
>> Write back caching with volatile write cache? For your device, can you
>> do:
>>
>> $ grep . /sys/block/$dev/queue/*
>>
>>> Checking the fsstress verbose log against the failed file, it turns out
>>> to be an io_uring write.
>>
>> Any more details on what the write looks like?
> 
> For the involved file, it shows the following operations for the minimal
> reproducible seed/-p/-n combination:
> 
> ```
> 0/24: link d0/f2 d0/f3 0
> 0/29: fallocate(INSERT_RANGE) d0/f3 [276 2 0 0 176 481971]t 884736 585728 95
> 0/30: uring_write d0/f3[276 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
> 0/31: writev d0/f3[276 2 0 0 296 1457078] [709121,8,964] 0
> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 320
> 1457078] return 25, fallback to stat()
> 0/34: dwrite d0/f3[276 2 308134 1763236 320 1457078] [589824,16384] 0
> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 496
> 1457078] return 25, fallback to stat()
> 0/38: dwrite d0/f3[276 2 308134 1763236 496 1457078] [2084864,36864] 0
> 0/40: fallocate(ZERO_RANGE) d0/f3 [276 2 308134 1763236 688 2809139]t
> 3512660 81075 0
> 0/43: splice d0/f5[289 1 0 0 1872 2678784] [552619,59420] -> d0/f3[276 2
> 308134 1763236 856 3593735] [5603798,59420] 0
> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [276 1 308134 1763236 976
> 5663218]t 1361821 480392 0
> 0/49: clonerange d0/f3[276 1 308134 1763236 856 5663218] [2461696,53248]
> -> d0/f5[289 1 0 0 1872 2678784] [942080,53248]
> ```

And just to be sure, this is not mixing dio and buffered, right?

>>> However I didn't see any io_uring related callback inside btrfs code,
>>> any advice on the io_uring part would be appreciated.
>>
>> io_uring doesn't do anything special here, it uses the normal page cache
>> read/write parts for buffered IO. But you may get extra parallellism
>> with io_uring here. For example, with the buffered write that this most
>> likely is, libaio would be exactly the same as a pwrite(2) on the file.
>> If this would've blocked, io_uring would offload this to a helper
>> thread. Depending on the workload, you could have multiple of those in
>> progress at the same time.
> 
> My biggest concern is, would io_uring modify the page when it's still
> under writeback?

No, of course not. Like I mentioned, io_uring doesn't do anything that
the normal read/write path isn't already doing - it's using the same
->read_iter() and ->write_iter() that everything else is, there's no
page cache code in io_uring.

> In that case, it's going to cause csum mismatch as btrfs relies on the
> page under writeback to be unchanged.

Sure, I'm aware of the stable page requirements.

See my followup email as well on a patch to test as well.

-- 
Jens Axboe