On 8/16/23 3:46 PM, Qu Wenruo wrote: > > > On 2023/8/16 22:33, Jens Axboe wrote: >> On 8/16/23 12:52 AM, Qu Wenruo wrote: >>> Hi, >>> >>> Recently I'm digging into a very rare failure during btrfs/06[234567], >>> where btrfs scrub detects unrepairable data corruption. >>> >>> After days of digging, I have a much smaller reproducer: >>> >>> ``` >>> fail() >>> { >>> echo "!!! FAILED !!!" >>> exit 1 >>> } >>> >>> workload() >>> { >>> mkfs.btrfs -f -m single -d single --csum sha256 $dev1 >>> mount $dev1 $mnt >>> # There are around 10 more combinations with different >>> # seed and -p/-n parameters, but this is the smallest one >>> # I found so far. >>> $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt >>> umount $mnt >>> btrfs check --check-data-csum $dev1 || fail >>> } >>> runtime=1024 >>> for (( i = 0; i < $runtime; i++ )); do >>> echo "=== $i / $runtime ===" >>> workload >>> done >>> ``` >> >> Tried to reproduce this, both on a vm and on a real host, and no luck so >> far. I've got a few followup questions as your report is missing some >> important info: > > You may want to try much higher -p/-n numbers. > > For verification purpose, I normally go with -p 10 -n 10000, which has a > much higher chance to hit, but definitely too noisy for debug. > > I just tried a run with "$fsstress -p 10 -n 10000 -w -d $mnt" as the > workload, it failed at 21/1024. OK I'll try that. >> 1) What kernel are you running? > > David's misc-next branch, aka, lastest upstream tags plus some btrfs > patches for the next merge window. > > Although I have some internal reports showing this problem quite some > time ago. That's what I was getting at, if it was new or not. >> 2) What's the .config you are using? > > Pretty common config, no heavy debug options (KASAN etc). Please just send the .config, I'd rather not have to guess. Things like preempt etc may make a difference in reproducing this. >>> At least here, with a VM with 6 cores (host has 8C/16T), fast enough >>> storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around >>> 1/100 to hit the error. >> >> What does "unsafe cche mode" mean? > > Libvirt cache option "unsafe" > > Which is mostly ignoring flush/fua commands and fully rely on host fs > (in my case it's file backed) cache. Gotcha >> Is that write back caching enabled? >> Write back caching with volatile write cache? For your device, can you >> do: >> >> $ grep . /sys/block/$dev/queue/* >> >>> Checking the fsstress verbose log against the failed file, it turns out >>> to be an io_uring write. >> >> Any more details on what the write looks like? > > For the involved file, it shows the following operations for the minimal > reproducible seed/-p/-n combination: > > ``` > 0/24: link d0/f2 d0/f3 0 > 0/29: fallocate(INSERT_RANGE) d0/f3 [276 2 0 0 176 481971]t 884736 585728 95 > 0/30: uring_write d0/f3[276 2 0 0 176 481971] [1400622, 56456(res=56456)] 0 > 0/31: writev d0/f3[276 2 0 0 296 1457078] [709121,8,964] 0 > 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 320 > 1457078] return 25, fallback to stat() > 0/34: dwrite d0/f3[276 2 308134 1763236 320 1457078] [589824,16384] 0 > 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 496 > 1457078] return 25, fallback to stat() > 0/38: dwrite d0/f3[276 2 308134 1763236 496 1457078] [2084864,36864] 0 > 0/40: fallocate(ZERO_RANGE) d0/f3 [276 2 308134 1763236 688 2809139]t > 3512660 81075 0 > 0/43: splice d0/f5[289 1 0 0 1872 2678784] [552619,59420] -> d0/f3[276 2 > 308134 1763236 856 3593735] [5603798,59420] 0 > 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [276 1 308134 1763236 976 > 5663218]t 1361821 480392 0 > 0/49: clonerange d0/f3[276 1 308134 1763236 856 5663218] [2461696,53248] > -> d0/f5[289 1 0 0 1872 2678784] [942080,53248] > ``` And just to be sure, this is not mixing dio and buffered, right? >>> However I didn't see any io_uring related callback inside btrfs code, >>> any advice on the io_uring part would be appreciated. >> >> io_uring doesn't do anything special here, it uses the normal page cache >> read/write parts for buffered IO. But you may get extra parallellism >> with io_uring here. For example, with the buffered write that this most >> likely is, libaio would be exactly the same as a pwrite(2) on the file. >> If this would've blocked, io_uring would offload this to a helper >> thread. Depending on the workload, you could have multiple of those in >> progress at the same time. > > My biggest concern is, would io_uring modify the page when it's still > under writeback? No, of course not. Like I mentioned, io_uring doesn't do anything that the normal read/write path isn't already doing - it's using the same ->read_iter() and ->write_iter() that everything else is, there's no page cache code in io_uring. > In that case, it's going to cause csum mismatch as btrfs relies on the > page under writeback to be unchanged. Sure, I'm aware of the stable page requirements. See my followup email as well on a patch to test as well. -- Jens Axboe