Disabling smap seems to fix the problem... Now for the hard question: WHY? I went from the "mirror host CPU" to "Skylake client" with "security mitigations" enabled. I then added disabling SMAP... by editing the virsh xml configuration. This left me with this in the XML definition: <cpu mode='custom' match='exact' check='partial'> <model fallback='allow'>Skylake-Client</model> <feature policy='require' name='ibpb'/> <feature policy='require' name='md-clear'/> <feature policy='require' name='spec-ctrl'/> <feature policy='require' name='ssbd'/> <feature policy='disable' name='smap'/> </cpu> I then started the VM, did the exact same thing that crashed, and the crash didn't happen. Resetting it to just "Skylake client" with CPU security mitigations enabled crashes again Failing override: <cpu mode='custom' match='exact' check='partial'> <model fallback='allow'>Skylake-Client</model> <feature policy='require' name='ibpb'/> <feature policy='require' name='md-clear'/> <feature policy='require' name='spec-ctrl'/> <feature policy='require' name='ssbd'/> </cpu> If I go back to mirroring the host configuration, it (still) still crashes, and this is what the CPU section looks like at runtime: <cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>Skylake-Client-IBRS</model> <vendor>Intel</vendor> <feature policy='require' name='ss'/> <feature policy='require' name='vmx'/> <feature policy='require' name='pdcm'/> <feature policy='require' name='hypervisor'/> <feature policy='require' name='tsc_adjust'/> <feature policy='require' name='clflushopt'/> <feature policy='require' name='umip'/> <feature policy='require' name='md-clear'/> <feature policy='require' name='stibp'/> <feature policy='require' name='arch-capabilities'/> <feature policy='require' name='ssbd'/> <feature policy='require' name='xsaves'/> <feature policy='require' name='pdpe1gb'/> <feature policy='require' name='ibpb'/> <feature policy='require' name='ibrs'/> <feature policy='require' name='amd-stibp'/> <feature policy='require' name='amd-ssbd'/> <feature policy='require' name='skip-l1dfl-vmentry'/> <feature policy='require' name='pschange-mc-no'/> <feature policy='disable' name='mpx'/> </cpu> On Wed, May 18, 2022 at 5:26 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Wed, May 18, 2022, Brian Cowan wrote: > > Hi all, looking for hints on a wild crash. > > > > The company I work for has a kernel driver used to literally make a db > > query result look like a filesystem… The “database” in question being > > a proprietary SCM repository… (ClearCase, for those who have been > > around forever… Like me…) > > > > We have a crash on mounting the remote repository ONE way (ClearCase > > “Automatic views”) but not another (ClearCase “Dynamic views”) where > > both use the same kernel driver… The guest OS is RHEL 7.8, not > > registered with RH (since the VM is only supposed to last a couple of > > days.) The host OS is Ubuntu 20.04.2 LTS, though that does not seem to > > matter. > > > > The wild part is that this only happens when the ClearCase host is a > > KVM guest, and only on 6th-generation or newer . It does NOT happen > > on: > > * VMWare Virtual machines configured identically > > * VirtualBox Virtual machines Configured identically > > * 2nd generation intel core hosts running the same KVM release. > > (because OF COURSE my office "secondary desktop" host is ancient... > > Heh, Sandy Bridge isn't ancient, we still get bug reports for Core2 :-) > > > * A 4th generation I7 host running Ubuntu 22.04 and that version’s > > default KVM. (Because I am a laptop packrat. That laptop had been > > sitting on a bookshelf for 3+ years and I went "what if...") > > What kernel version is the 6th gen (Skylake) 20.04.2 running? Same question for > the 4th gen (Haswell) 22.04. And if it's not too much trouble, can you try running > the Skylake with 22.04 kernel, or vice versa? Not super high priority if it's a > pain, the fact that the bug goes away based on what's advertised to the guest > suggests this might be a guest bug. But, it could also be a KVM bug that's > specific to a feature that's only supported in Skylake+. > > > If I edit the KVM configuration and change the “mirror host CPU” > > option to use the 2nd or 4th generation CPU options, the crash stops > > happening… If this was happening on physical machines, the VM crash > > would make sense, but it's literally a hypervisor-specific crash. > > > > Any hints, tips, or comments would be most appreciated... Never > > thought I'd be trying to debug kernel/hypervisor interactions, but > > here I am... > > It might be that there's a guest bug. And even if it's not a guest bug, you can > likely identify exactly what feature is problematic, though it might require > invoking QEMU directly (I don't know exactly what level of vCPU customization > libvirt allows). > > First thing to try: does it repro by explicitly specifying "Skylake-Client" as the > vCPU model? No idea what libvirt calls that. If that works, then I think XSAVES > would be to blame; AFAICT that's the only thing that might be exposed by "mirror > host CPU" and not the explicit "Skylake-Client". XSAVE being to blame seems unlikely > though. > > Assuming "Skylake-Client" fails, the next step would be to disable features that > are in "Skylake-Client" but not "Haswell", one by one, to figure out what's to > blame. > > In QEMU, the featuers I see being in Skylake but not Haswell are: > > 3dnowprefetch, rdseed, adx, smap, xsavec, xgetbv1 > > Again, no idea if/how libvirt exposes that level of granularity. For running > QEMU directly, removing all those features would be: > > -cpu Skylake-Client,-3dnowprefetch,-rdseed,-adx,-smap,-xsavec,-xgetbv1 > > My money is on SMAP :-)