On Tue, Sep 27, 2022, Ashish Kalra wrote: > With this patch applied, we are observing soft lockup and RCU stall issues on > SNP guests with 128 vCPUs assigned and >=10GB guest memory allocations. ... > From the call stack dumps, it looks like migrate_pages() > The invocation of > migrate_pages() as in the following code path does not seem right: > > do_huge_pmd_numa_page > migrate_misplaced_page > migrate_pages > > as all the guest memory for SEV/SNP VMs will be pinned/locked, so why is the > page migration code path getting invoked at all ? LOL, I feel your pain. It's the wonderful NUMA autobalancing code. It's been a while since I looked at the code, but IIRC, it "works" by zapping PTEs for pages that aren't allocated on the "right" node without checking if page migration is actually possible. The actual migration is done on the subsequent page fault. In this case, the balancer detects that the page can't be migrated and reinstalls the original PTE. I don't know if using FOLL_LONGTERM would help? Again, been a while. The workaround I've used in the past is to simply disable the balancer, e.g. CONFIG_NUMA_BALANCING=n or numa_balancing=disable on the kernel command line.