Re: [Question] debugging VM cpu hotplug (#GP -> #DF) which results in reset

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 15 Jun 2022 15:00:42 +0000

On Wed, Jun 15, 2022, Alexander Mikhalitsyn wrote:
> Dear friends,
> 
> I'm sorry for disturbing you but I've getting stuck with debugging KVM
> problem and looking for an advice. I'm working mostly on kernel
> containers/CRIU and am newbie with KVM so, I believe that I'm missing
> something very simple.
> 
> My case:
> - AMD EPYC 7443P 24-Core Processor (Milan family processor)
> - OpenVZ kernel (based on RHEL7 3.10.0-1160.53.1) on the Host Node (HN)
> - Qemu/KVM VM (8 vCPU assigned) with many different kernels from 3.10.0-1160 RHEL7 to mainline 5.18
> 
> Reproducer (run inside VM):
> echo 0 > /sys/devices/system/cpu/cpu3/online
> echo 1 > /sys/devices/system/cpu/cpu3/online <- got reset here
> 
> *Not* reproducible on:
> - any Intel which we tried
> - AMD EPYC 7261 (Rome family)

Hmm, given that Milan is problematic but Rome isn't, that implies the bug is related
to a feature that's new in Milan.  PCID is the one that comes to mind, and IIRC there
were issues with PCID (or INVCPID?) in various kernels when running on Milan.

Can you try hiding PCID and INVPCID from the guest?

> - without KVM (on Host)

...

> ==== trace-cmd record -b 20000 -e kvm:kvm_cr -e kvm:kvm_userspace_exit -e probe:* =====
> 
>              CPU-1834  [003] 69194.833364: kvm_userspace_exit:   reason KVM_EXIT_IO (2)
>              CPU-1838  [000] 69194.834177: kvm_multiple_exception_L9: (ffffffff814313c6) vcpu=0xffff93ee9a528000
>              CPU-1838  [000] 69194.834180: kvm_multiple_exception_L41: (ffffffff81431493) vcpu=0xffff93ee9a528000 exception=0xd000001 has_error=0x0 nr=0xd error_code=0x0 has_payload=0x0
>              CPU-1838  [000] 69194.834195: kvm_multiple_exception_L9: (ffffffff814313c6) vcpu=0xffff93ee9a528000
>              CPU-1838  [000] 69194.834196: kvm_multiple_exception_L41: (ffffffff81431493) vcpu=0xffff93ee9a528000 exception=0x8000100 has_error=0x0 nr=0x8 error_code=0x0 has_payload=0x0
>              CPU-1838  [000] 69194.834200: shutdown_interception_L8: (ffffffff8146e4a0)

If you can modify the host kernel, throwing a WARN in kvm_multiple_exception() should
pinpoint the source of the #GP.  Though you may get unlucky and find that KVM is just
reflecting an intercepted a #GP that was first "injected" by hardware.  Note that this
could spam the log if KVM is injecting a large number of #GPs.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9cea051ca62e..19d959bf97cc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -612,6 +612,8 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
        u32 prev_nr;
        int class1, class2;

+       WARN_ON(nr == GP_VECTOR);
+
        kvm_make_request(KVM_REQ_EVENT, vcpu);

        if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) {