read corruption with qemu master io_uring engine / linux master / btrfs(?)

Dominique MARTINET <dominique.martinet@xxxxxxxxxxxxxxxxx> · Tue, 28 Jun 2022 18:08:40 +0900

I don't have any good reproducer so it's a bit difficult to specify,
let's start with what I have...

I've got this one VM which has various segfaults all over the place when
starting it with aio=io_uring for its disk as follow:

  qemu-system-x86_64 -drive file=qemu/atde-test,if=none,id=hd0,format=raw,cache=none,aio=io_uring \
      -device virtio-blk-pci,drive=hd0 -m 8G -smp 4 -serial mon:stdio -enable-kvm

It also happens with virtio-scsi-blk:
  -device virtio-scsi-pci,id=scsihw0 \
  -drive file=qemu/atde-test,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring \
  -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100

It also happened when the disk I was using was a qcow file backing up a
vmdk image (this VM's original disk is for vmware), so while I assume
qemu reading code and qemu-img convert code are similar I'll pretend
image format doesn't matter at this point...
It's happened with two such images, but I haven't been able to reproduce
with any other VMs yet.

I can also reproduce this on a second host machine with a completely
different ssd (WD sata in one vs. samsung nvme), so probably not a
firmware bug.

scrub sees no problem with my filesystems on the host.

I've confirmed it happens with at least debian testing's 5.16.0-4-amd64
and 5.17.0-1-amd64 kernels, as well as 5.19.0-rc4.
It also happens with both debian's 7.0.0 and the master branch
(v7.0.0-2031-g40d522490714)

These factors aside, anything else I tried changing made this bug no
longer reproduce:
 - I'm not sure what the rule is but it sometimes doesn't happen when
running the VM twice in a row, sometimes it happens again. Making a
fresh copy with `cp --reflink=always` of my source image seems to be
reliable.
 - it stops happening without io_uring
 - it stops happening if I copy the disk image with --reflink=never
 - it stops happening if I copy the disk image to another btrfs
partition, created in the same lv, so something about my partition
history matters?...
(I have ssd > GPT partitions > luks > lvm > btrfs with a single disk as
metadata DUP data single)
 - I was unable to reproduce on xfs (with a reflink copy) either but I
also was only able to try on a new fs...
 - I've never been able to reproduce on other VMs

If you'd like to give it a try, my reproducer source image is
---
curl -O https://download.atmark-techno.com/atde/atde9-amd64-20220624.tar.xz
tar xf atde9-amd64-20220624.tar.xz
qemu-img convert -O raw atde9-amd64-20220624/atde9-amd64.vmdk atde-test
cp --reflink=always atde-test atde-test2
---
and using 'atde-test'.
For further attempts I've removed atde-test and copied back from
atde-test2 with cp --reflink=always.
This VM graphical output is borked, but ssh listens so something like
`-netdev user,id=net0,hostfwd=tcp::2227-:22 -device virtio-net-pci,netdev=net0`
and 'ssh -p 2227 -l atmark localhost' should allow login with password
'atmark' or you can change vt on the console (root password 'root')

I also had similar problems with atde9-amd64-20211201.tar.xz .

When reproducing I've had either segfaults in the initrd and complete
boot failures, or boot working and login failures but ssh working
without login shell (ssh ... -tt localhost sh)
that allowed me to dump content of a couple of corrupted files.
When I looked:
- /usr/lib/x86_64-linux-gnu/libc-2.31.so had zeroes instead of data from
offset 0xb6000 to 0xb7fff; rest of file was identical.
- /usr/bin/dmesg had garbadge from 0x05000 until 0x149d8 (end of file).
I was lucky and could match the garbage quickly: it is identical to the
content from 0x1000-0x109d8 within the disk itself.

I've rebooted a few times and it looks like the corruption is identical
everytime for this machine as long as I keep using the same source file;
running from qemu-img convert again seems to change things a bit?
but whatever it is that is specific to these files is stable, even
through host reboots.

I'm sorry I haven't been able to make a better reproducer, I'll keep
trying a bit more tomorrow but maybe someone has an idea with what I've
had so far :/

Perhaps at this point it might be simpler to just try to take qemu out
of the equation and issue many parallel reads to different offsets
(overlapping?) of a large file in a similar way qemu io_uring engine
does and check their contents?

Thanks, and I'll probably follow up a bit tomorrow even if no-one has
any idea, but even ideas of where to look would be appreciated.
-- 
Dominique