On Mon, Jan 28, 2019 at 08:25:20PM +0100, Thomas Lindroth wrote: > I run a qemu/kvm VM with debian and I've started getting segfaults and failing checksums on > downloaded files. The failures are undeterministic and similar to the failures you get with > bad ram. I tried to diagnose the problem with various testing tools and found that > "stress-ng --verify --cpu 1" always give an error. Stress-ng give one of these errors > usually within 60 sec: > > stress-ng-cpu: Newton-Rapshon sqrt not accurate enough > stress-ng-cpu: prime error detected, number of primes between 0 and 1000000 miscalculated > > Nothing relevant has changed recently in the VM but the host kernel was upgraded from > 4.14.93 to 4.14.96. I can't reproduce the stress-ng error with a 4.14.93 host kernel. There > is only one kvm related change in that range so I tried to revert that one. > > By reverting commit 4124a4cff344abbf8187775eb643d9827830e715 > "x86,kvm: move qemu/guest FPU switching out to vcpu_run" on kernel 4.14.96 I can't reproduce > the stress-ng error and I have no segfault or other problems with the guest. This is the second report of this issue: https://bugzilla.kernel.org/show_bug.cgi?id=202419 Upon inspection, the commit in question is obviously buggy, kvm_arch_vcpu_ioctl_run() doubles up on kvm_{load,put}_guest_fpu(). The ordering of mainline commits: f775b13eedee ("x86,kvm: move qemu/guest FPU switching out to vcpu_run") and 5663d8f9bbe4 ("kvm: x86: fix WARN due to uninitialized guest FPU state") were reversed when backported to 4.14. Commit 5663d8f9bbe4 even explicitly notes that it fixes f775b13eedee. I'll send a patch. > > The commit was originally introduced in v4.15-rc3 (Nov 14 2017) and was only recently > backported to 4.14. The other stable kernels before 4.14 didn't get any backport so it looks > like a broken 4.14 backport. That backport also cause problems for other people. > https://bugzilla.kernel.org/show_bug.cgi?id=202419 > > I've rebooted between the different kernels and rebooted the VM enough to be reasonably sure > that commit is the problem. Stress-ng never lasts more than 10 min with that commit but works > for hours without it. > > Steps to reproduce would be to create a qemu/kvm VM with debian stretch, install stress-ng > version 0.07.16 and run "stress-ng --verify --cpu 1". > > Here is the qemu-3.1.0 commandline generated by libvirt: > /usr/bin/qemu-system-x86_64 -name guest=debian,debug-threads=on -S -object > secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-debian/master-key.aes > -machine pc-i440fx-2.4,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX -m 2048 > -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid > 0473ded4-d417-4b0e-a4f5-36ba5a2cd675 -no-user-config -nodefaults -chardev > socket,id=charmonitor,fd=21,server,nowait -mon chardev=charmonitor,id=monitor,mode=control > -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown > -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on > -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device > ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device > ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device > ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -drive > if=none,id=drive-ide0-0-1,readonly=on -device > ide-cd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1,bootindex=2 -drive > file=/mnt/gemini.61rn.3T/Backups/debian.raw,format=raw,if=none,id=drive-virtio-disk0 -device > virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 > -netdev tap,fd=23,id=hostnet0 -device > virtio-net-pci,netdev=hostnet0,id=net0,mac=00:11:22:33:44:55,bus=pci.0,addr=0x3 -spice > port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device > VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device AC97,id=sound0,bus=pci.0,addr=0x7 > -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -object > rng-random,id=objrng0,filename=/dev/random -device > virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x8 -sandbox > on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on > > My host kernel .config is big so I put it in a paste: http://sprunge.us/u7YNBt