Hello All,
Here is a summary of all the discussions and options we have discussed
off-list :
For SNP guests we don't need to invoke the MMU invalidation notifiers
and the cache flush should be done at the point of RMP ownership change
instead of mmu_notifier, which will be when the unregister_enc_region
ioctl is called, but as we don't trust the userspace (which can bypass
this ioctl), therefore we continue to use the MMU invalidation
notifiers. With UPM support added, we will be avoiding the RMP #PF split
code path to split the host pagetable to be in sync with the RMP table
entries, and therefore, mmu_notifier invoked from __split_huge_pmd()
won’t be of concern.
For the MMU invalidation notifiers we are going to make two changes
currently:
1). Use clflush/clflushopt instead of wbinvd_on_all_cpus() for range <=
2MB.
But this is not that straightforward, for SME_COHERENT platforms (Milan
and beyond), clflush/clflushopt will flush guest tagged cache entries,
but before Milan (!SME_COHERENT) we will need to use either
VM_PAGE_FLUSH MSR or wbinvd to flush guest tagged cache entries. So for
non SME_COHERENT platforms, there is no change and effectively no
optimizations.
2). We also add the filtering in mmu_notifier (from Sean's patch) which
invokes the mmu invalidation notifiers depending on the flag passed to
the notifier.
This will assist in reducing the overhead with NUMA balancing and
especially eliminates the mmu_notifier invocations for the
change_protection case.
Thanks,
Ashish
On 9/26/2022 7:37 PM, Sean Christopherson wrote:
On Tue, Sep 27, 2022, Ashish Kalra wrote:
With this patch applied, we are observing soft lockup and RCU stall issues on
SNP guests with 128 vCPUs assigned and >=10GB guest memory allocations.
...
From the call stack dumps, it looks like migrate_pages() > The invocation of
migrate_pages() as in the following code path does not seem right:
do_huge_pmd_numa_page
migrate_misplaced_page
migrate_pages
as all the guest memory for SEV/SNP VMs will be pinned/locked, so why is the
page migration code path getting invoked at all ?
LOL, I feel your pain. It's the wonderful NUMA autobalancing code. It's been a
while since I looked at the code, but IIRC, it "works" by zapping PTEs for pages that
aren't allocated on the "right" node without checking if page migration is actually
possible.
The actual migration is done on the subsequent page fault. In this case, the
balancer detects that the page can't be migrated and reinstalls the original PTE.
I don't know if using FOLL_LONGTERM would help? Again, been a while. The workaround
I've used in the past is to simply disable the balancer, e.g.
CONFIG_NUMA_BALANCING=n
or
numa_balancing=disable
on the kernel command line.