Re: A really weird guest crash, that ONLY happens on KVM, and ONLY on 6th gen+ Intel Core CPU's

Brian Cowan <brcowan@xxxxxxxxx> · Fri, 20 May 2022 18:03:10 -0400

Well, the weird thing is that this is hypervisor-specific. KVM=kaboom.
VirtualBox is happy, and we can't make this happen on
roughly-analogous ESX hosts. I can't directly test on my (ubuntu)
laptop because the driver won't build on the too-new ubuntu 20.04.2
"Hardware enablement" kernel as it's too new. But either all the other
hypervisors are doing this wrong and allowing this access, or KVM is.

Not being a kernel expert makes this interesting. I'm passing the
possibility list over the wall to the kernel folks, but most of the
evidence we're seeing **seems** to point to KVM...

On Fri, May 20, 2022 at 11:22 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Fri, May 20, 2022, Brian Cowan wrote:
> > Disabling smap seems to fix the problem...
>
> Mwhahaha, I should have found someone to bet me real money :-)
>
> > Now for the hard question: WHY?
>
> The most likely scenario it that there's a SMAP violation (#PF due to a kernel
> access to user data without an override to tell the CPU that the access is intentional)
> somewhere in the guest that crashes/panics the guest kernel.  Assuming that's the
> case, there are three-ish possibilities:
>
>   1. There's a bug your company's custom kernel driver.
>   2. There's a SMAP violation somewhere else in RHEL 7.8, which is an 8+ year old
>      frankenkernel...
>   3. There's a bug in your version of KVM related to SMAP virtualization
>
> #3 begs the question, does this fail on bare metal that supports SMAP?  If so,
> then that rules out #3.
>
> If the crash occurs only when doing stuff related to your custom driver, #1 is
> most likely the culprit.
>
> One way to try and debug further would be to disable EPT in KVM (load kvm_intel with
> ept=0) and then use KVM tracepoints to see when the guest dies.  If it's a SMAP
> violation, there should be an injected SMAP #PF shortly before the guest dies.