On Fri, Dec 15, 2023, bugzilla-daemon@xxxxxxxxxx wrote: > Platform: Sapphire Rapids Platform > > Host OS: CentOS Stream 9 > > Kernel:6.7.0-rc1 (commit:8ed26ab8d59111c2f7b86d200d1eb97d2a458fd1) ... > Qemu: QEMU emulator version 8.1.94 (v8.2.0-rc4) > (commit:039afc5ef7367fbc8fb475580c291c2655e856cb) > > Host Kernel cmdline:BOOT_IMAGE=/kvm-vmlinuz root=/dev/mapper/cs_spr--2s2-root > ro crashkernel=auto console=tty0 console=ttyS0,115200,8n1 3 intel_iommu=on > disable_mtrr_cleanup > > Bug detailed description > ======= > We boot up 8 Windows VMs (total vCPUs > pCPUs) in host, random run application > on each VM such as WPS editing etc, and wait for a moment, then Some of the > Windows Guest hang and console reports "KVM internal error. Suberror: 3". ... > Code=25 88 61 00 00 b9 70 00 00 40 0f ba 32 00 72 06 33 c0 8b d0 <0f> 30 5a 58 > 59 c3 cc cc cc cc cc cc 0f 1f 84 00 00 00 00 00 48 81 ec 38 01 00 00 48 8d 84 > > KVM internal error. Suberror: 3 > extra data[0]: 0x000000008000002f <= Vectoring IRQ 47 (decimal) > extra data[1]: 0x0000000000000020 <= WRMSR VM-Exit > extra data[2]: 0x0000000000000f82 > extra data[3]: 0x000000000000004b KVM exits with an internal error because the CPU indicates that IRQ 47 was being delivered/vectored when the VM-Exit occurred, but the VM-Exit is due to WRMSR. A WRMSR VM-Exit is supposed to only occur on an instruction boundary, i.e. can't occur while delivering an IRQ (or any exception/event), and so KVM kicks out to userspace because something has gone off the rails. b9 70 00 00 40 mov 0x40000070, ecx 0f ba 32 00 btr 0x0, DWORD PTR [rdx] 72 06 jb 0x16 33 c0 xor eax,eax 8b d0 mov eax, edx 0f 30 wrmsr FWIW, the MSR in question is Hyper-V's synthetic EOI, a.k.a. HV_X64_MSR_EOI, though I doubt the exact MSR matters. Have you tried an older host kernel? If not can you try something like v6.1? Note, if you do, use base v6.1, *not* the stable tree in case a bug was backported. There was a recent change to relevant code, commit 50011c2a2457 ("KVM: VMX: Refresh available regs and IDT vectoring info before NMI handling"), though I don't see any obvious bugs. But I'm pretty sure the only alternative explanation is a CPU/ucode bug, so it's definitely worth checking older versions of KVM.