On Wed, Jun 15, 2022, Alexander Mikhalitsyn wrote: > Dear friends, > > I'm sorry for disturbing you but I've getting stuck with debugging KVM > problem and looking for an advice. I'm working mostly on kernel > containers/CRIU and am newbie with KVM so, I believe that I'm missing > something very simple. > > My case: > - AMD EPYC 7443P 24-Core Processor (Milan family processor) > - OpenVZ kernel (based on RHEL7 3.10.0-1160.53.1) on the Host Node (HN) > - Qemu/KVM VM (8 vCPU assigned) with many different kernels from 3.10.0-1160 RHEL7 to mainline 5.18 > > Reproducer (run inside VM): > echo 0 > /sys/devices/system/cpu/cpu3/online > echo 1 > /sys/devices/system/cpu/cpu3/online <- got reset here > > *Not* reproducible on: > - any Intel which we tried > - AMD EPYC 7261 (Rome family) Hmm, given that Milan is problematic but Rome isn't, that implies the bug is related to a feature that's new in Milan. PCID is the one that comes to mind, and IIRC there were issues with PCID (or INVCPID?) in various kernels when running on Milan. Can you try hiding PCID and INVPCID from the guest? > - without KVM (on Host) ... > ==== trace-cmd record -b 20000 -e kvm:kvm_cr -e kvm:kvm_userspace_exit -e probe:* ===== > > CPU-1834 [003] 69194.833364: kvm_userspace_exit: reason KVM_EXIT_IO (2) > CPU-1838 [000] 69194.834177: kvm_multiple_exception_L9: (ffffffff814313c6) vcpu=0xffff93ee9a528000 > CPU-1838 [000] 69194.834180: kvm_multiple_exception_L41: (ffffffff81431493) vcpu=0xffff93ee9a528000 exception=0xd000001 has_error=0x0 nr=0xd error_code=0x0 has_payload=0x0 > CPU-1838 [000] 69194.834195: kvm_multiple_exception_L9: (ffffffff814313c6) vcpu=0xffff93ee9a528000 > CPU-1838 [000] 69194.834196: kvm_multiple_exception_L41: (ffffffff81431493) vcpu=0xffff93ee9a528000 exception=0x8000100 has_error=0x0 nr=0x8 error_code=0x0 has_payload=0x0 > CPU-1838 [000] 69194.834200: shutdown_interception_L8: (ffffffff8146e4a0) If you can modify the host kernel, throwing a WARN in kvm_multiple_exception() should pinpoint the source of the #GP. Though you may get unlucky and find that KVM is just reflecting an intercepted a #GP that was first "injected" by hardware. Note that this could spam the log if KVM is injecting a large number of #GPs. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9cea051ca62e..19d959bf97cc 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -612,6 +612,8 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, u32 prev_nr; int class1, class2; + WARN_ON(nr == GP_VECTOR); + kvm_make_request(KVM_REQ_EVENT, vcpu); if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) {