On Thu, Dec 14, 2023, bugzilla-daemon@xxxxxxxxxx wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=218259 > > --- Comment #2 from Joern Heissler (kernelbugs2012@xxxxxxxxxxxxxxxxx) --- > Hi, > > 1. KSM is already disabled. Didn't try to enable it. > 2. NUMA autobalancing was enabled on the host (value 1), not in the guest. When > disabled, I can't see the issue anymore. This is likely/hopefully the same thing Yan encountered[1]. If you are able to test patches, the proposed fix[2] applies cleanly on v6.6 (note, I need to post a refreshed version of the series regardless), any feedback you can provide would be much appreciated. KVM changes aside, I highly recommend evaluating whether or not NUMA autobalancing is a net positive for your environment. The interactions between autobalancing and KVM are often less than stellar, and disabling autobalancing is sometimes a completely legitimate option/solution. [1] https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@xxxxxxxxxxxxxxxxxxxxxxxxx [2] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@xxxxxxxxxx > 3. tdp_mmu was "Y", disabling it seems to make no difference. Hrm, that's odd. The commit blamed by bisection was purely a TDP MMU change. Did you relaunch VMs after disabling the module params? While the module param is writable, it's effectively snapshotted by each VM during creation, i.e. toggling it won't affect running VMs. > So might be related to NUMA. On older kernels, the flag is 1 as well. > > There's one difference in the kernel messages that I hadn't noticed before. The > newer one prints "pci_bus 0000:7f: Unknown NUMA node; performance will be > reduced" (same with ff again). The older ones don't. No idea what this means, > if it's important, and can't find info on the web regarding it. That was a new message added by commit ad5086108b9f ("PCI: Warn if no host bridge NUMA node info"), which was first released in v5.5. AFAICT, that warning is only complaning about the driver code for PCI devices possibly running on the wrong node. However, if you are seeing that error on v6.1 or v6.6, but not v5.17, i.e. if the message started showing up well after the printk was added, then it might be a symptom of an underlying problem, e.g. maybe the kernel is botching parsing of ACPI tables? > I think the kernel is preemptible: Ya, not fully preemptible (voluntary only), but the important part is that KVM will drop mmu_lock if there is contention (which is a "requirement" for the bug that Yan encountered).