https://bugzilla.kernel.org/show_bug.cgi?id=218259 --- Comment #6 from Joern Heissler (kernelbugs2012@xxxxxxxxxxxxxxxxx) --- (In reply to Sean Christopherson from comment #5) > This is likely/hopefully the same thing Yan encountered[1]. If you are able > to > test patches, the proposed fix[2] applies cleanly on v6.6 (note, I need to > post a > refreshed version of the series regardless), any feedback you can provide > would > be much appreciated. > > [1] https://lore.kernel.org/all/ZNnPF4W26ZbAyGto@xxxxxxxxxxxxxxxxxxxxxxxxx > [2] https://lore.kernel.org/all/20230825020733.2849862-1-seanjc@xxxxxxxxxx I admit that I don't understand most of what's written in the those threads. I applied the two patches from [2] (excluding [3]) to v6.6 and it appears to solve the problem. However I haven't measured how/if any of the changes/flags affect performance or if any other problems are caused. After about 1 hour uptime it appears to be okay. [3] https://lore.kernel.org/all/ZPtVF5KKxLhMj58n@xxxxxxxxxx/ > KVM changes aside, I highly recommend evaluating whether or not NUMA > autobalancing is a net positive for your environment. The interactions > between > autobalancing and KVM are often less than stellar, and disabling > autobalancing > is sometimes a completely legitimate option/solution. I'll have to evaluate multiple options for my production environment. Patching+Building the kernel myself would only be a last resort. And it will probably take a while until Debian ships a patch for the issue. So maybe disable the NUMA balancing, or perhaps try to pin a VM's memory+cpu to a single NUMA node. > > 3. tdp_mmu was "Y", disabling it seems to make no difference. > > Hrm, that's odd. The commit blamed by bisection was purely a TDP MMU change. > Did you relaunch VMs after disabling the module params? While the module > param > is writable, it's effectively snapshotted by each VM during creation, i.e. > toggling > it won't affect running VMs. It's quite possible that I did not restart the VM afterwards. I tried again, this time paying attention. Setting it to "N" *does* seem to eliminate the issue. > > The newer one prints "pci_bus 0000:7f: Unknown NUMA node; performance will > be > > reduced" (same with ff again). The older ones don't. > > That was a new message added by commit ad5086108b9f ("PCI: Warn if no host > bridge > NUMA node info"), which was first released in v5.5. Seems I looked on systems running older (< v5.5) kernels. On the ones with v5.10 the message is printed too. Thanks a lot so far, I believe I've now got enough options to consider for my production environment. -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.