On Wed, May 18, 2022, Brian Cowan wrote: > Hi all, looking for hints on a wild crash. > > The company I work for has a kernel driver used to literally make a db > query result look like a filesystem… The “database” in question being > a proprietary SCM repository… (ClearCase, for those who have been > around forever… Like me…) > > We have a crash on mounting the remote repository ONE way (ClearCase > “Automatic views”) but not another (ClearCase “Dynamic views”) where > both use the same kernel driver… The guest OS is RHEL 7.8, not > registered with RH (since the VM is only supposed to last a couple of > days.) The host OS is Ubuntu 20.04.2 LTS, though that does not seem to > matter. > > The wild part is that this only happens when the ClearCase host is a > KVM guest, and only on 6th-generation or newer . It does NOT happen > on: > * VMWare Virtual machines configured identically > * VirtualBox Virtual machines Configured identically > * 2nd generation intel core hosts running the same KVM release. > (because OF COURSE my office "secondary desktop" host is ancient... Heh, Sandy Bridge isn't ancient, we still get bug reports for Core2 :-) > * A 4th generation I7 host running Ubuntu 22.04 and that version’s > default KVM. (Because I am a laptop packrat. That laptop had been > sitting on a bookshelf for 3+ years and I went "what if...") What kernel version is the 6th gen (Skylake) 20.04.2 running? Same question for the 4th gen (Haswell) 22.04. And if it's not too much trouble, can you try running the Skylake with 22.04 kernel, or vice versa? Not super high priority if it's a pain, the fact that the bug goes away based on what's advertised to the guest suggests this might be a guest bug. But, it could also be a KVM bug that's specific to a feature that's only supported in Skylake+. > If I edit the KVM configuration and change the “mirror host CPU” > option to use the 2nd or 4th generation CPU options, the crash stops > happening… If this was happening on physical machines, the VM crash > would make sense, but it's literally a hypervisor-specific crash. > > Any hints, tips, or comments would be most appreciated... Never > thought I'd be trying to debug kernel/hypervisor interactions, but > here I am... It might be that there's a guest bug. And even if it's not a guest bug, you can likely identify exactly what feature is problematic, though it might require invoking QEMU directly (I don't know exactly what level of vCPU customization libvirt allows). First thing to try: does it repro by explicitly specifying "Skylake-Client" as the vCPU model? No idea what libvirt calls that. If that works, then I think XSAVES would be to blame; AFAICT that's the only thing that might be exposed by "mirror host CPU" and not the explicit "Skylake-Client". XSAVE being to blame seems unlikely though. Assuming "Skylake-Client" fails, the next step would be to disable features that are in "Skylake-Client" but not "Haswell", one by one, to figure out what's to blame. In QEMU, the featuers I see being in Skylake but not Haswell are: 3dnowprefetch, rdseed, adx, smap, xsavec, xgetbv1 Again, no idea if/how libvirt exposes that level of granularity. For running QEMU directly, removing all those features would be: -cpu Skylake-Client,-3dnowprefetch,-rdseed,-adx,-smap,-xsavec,-xgetbv1 My money is on SMAP :-)