Re: A really weird guest crash, that ONLY happens on KVM, and ONLY on 6th gen+ Intel Core CPU's

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, May 18, 2022, Brian Cowan wrote:
> Hi all, looking for hints on a wild crash.
> 
> The company I work for has a kernel driver used to literally make a db
> query result look like a filesystem… The “database” in question being
> a proprietary SCM repository… (ClearCase, for those who have been
> around forever… Like me…)
> 
> We have a crash on mounting the remote repository ONE way (ClearCase
> “Automatic views”) but not another (ClearCase “Dynamic views”) where
> both use the same kernel driver… The guest OS is RHEL 7.8, not
> registered with RH (since the VM is only supposed to last a couple of
> days.) The host OS is Ubuntu 20.04.2 LTS, though that does not seem to
> matter.
> 
> The wild part is that this only happens when the ClearCase host is a
> KVM guest, and only on 6th-generation or newer . It does NOT happen
> on:
> * VMWare Virtual machines configured identically
> * VirtualBox Virtual machines Configured identically
> * 2nd generation intel core hosts running the same KVM release.
> (because OF COURSE my office "secondary desktop" host is ancient...

Heh, Sandy Bridge isn't ancient, we still get bug reports for Core2 :-)

> * A 4th generation I7 host running Ubuntu 22.04 and that version’s
> default KVM. (Because I am a laptop packrat. That laptop had been
> sitting on a bookshelf for 3+ years and I went "what if...")

What kernel version is the 6th gen (Skylake) 20.04.2 running?  Same question for
the 4th gen (Haswell) 22.04.  And if it's not too much trouble, can you try running
the Skylake with 22.04 kernel, or vice versa?  Not super high priority if it's a
pain, the fact that the bug goes away based on what's advertised to the guest
suggests this might be a guest bug.  But, it could also be a KVM bug that's
specific to a feature that's only supported in Skylake+.

> If I edit the KVM configuration and change the “mirror host CPU”
> option to use the 2nd or 4th generation CPU options, the crash stops
> happening… If this was happening on physical machines, the VM crash
> would make sense, but it's literally a hypervisor-specific crash.
> 
> Any hints, tips, or comments would be most appreciated... Never
> thought I'd be trying to debug kernel/hypervisor interactions, but
> here I am...

It might be that there's a guest bug.  And even if it's not a guest bug, you can
likely identify exactly what feature is problematic, though it might require
invoking QEMU directly (I don't know exactly what level of vCPU customization
libvirt allows).

First thing to try: does it repro by explicitly specifying "Skylake-Client" as the
vCPU model?  No idea what libvirt calls that.  If that works, then I think XSAVES
would be to blame; AFAICT that's the only thing that might be exposed by "mirror
host CPU" and not the explicit "Skylake-Client".  XSAVE being to blame seems unlikely
though.

Assuming "Skylake-Client" fails, the next step would be to disable features that
are in "Skylake-Client" but not "Haswell", one by one, to figure out what's to
blame.

In QEMU, the featuers I see being in Skylake but not Haswell are:

  3dnowprefetch, rdseed, adx, smap, xsavec, xgetbv1

Again, no idea if/how libvirt exposes that level of granularity.  For running
QEMU directly, removing all those features would be:

  -cpu Skylake-Client,-3dnowprefetch,-rdseed,-adx,-smap,-xsavec,-xgetbv1

My money is on SMAP :-)



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux