Re: [PATCH 5.4 1/1] KVM: SEV: add cache flush to solve SEV cache incoherency issues

"Kalra, Ashish" <ashish.kalra@xxxxxxx> · Thu, 6 Oct 2022 12:36:23 -0500

Hello All,

Here is a summary of all the discussions and options we have discussed
off-list :

For SNP guests we don't need to invoke the MMU invalidation notifiers 
and the cache flush should be done at the point of RMP ownership change 
instead of mmu_notifier, which will be when the unregister_enc_region 
ioctl is called, but as we don't trust the userspace (which can bypass 
this ioctl), therefore we continue to use the MMU invalidation 
notifiers. With UPM support added, we will be avoiding the RMP #PF split 
code path to split the host pagetable to be in sync with the RMP table 
entries, and therefore, mmu_notifier invoked from __split_huge_pmd() 
won’t be of concern.

For the MMU invalidation notifiers we are going to make two changes 
currently:

1). Use clflush/clflushopt instead of wbinvd_on_all_cpus() for range <= 
2MB.

But this is not that straightforward, for SME_COHERENT platforms (Milan 
and beyond), clflush/clflushopt will flush guest tagged cache entries, 
but before Milan (!SME_COHERENT) we will need to use either 
VM_PAGE_FLUSH MSR or wbinvd to flush guest tagged cache entries. So for 
non SME_COHERENT platforms, there is no change and effectively no 
optimizations.

2). We also add the filtering in mmu_notifier (from Sean's patch) which 
invokes the mmu invalidation notifiers depending on the flag passed to 
the notifier.
This will assist in reducing the overhead with NUMA balancing and 
especially eliminates the mmu_notifier invocations for the 
change_protection case.

Thanks,
Ashish

On 9/26/2022 7:37 PM, Sean Christopherson wrote:
On Tue, Sep 27, 2022, Ashish Kalra wrote:
With this patch applied, we are observing soft lockup and RCU stall issues on
SNP guests with 128 vCPUs assigned and >=10GB guest memory allocations.

...

 From the call stack dumps, it looks like migrate_pages() > The invocation of
migrate_pages() as in the following code path does not seem right:

     do_huge_pmd_numa_page
       migrate_misplaced_page
         migrate_pages

as all the guest memory for SEV/SNP VMs will be pinned/locked, so why is the
page migration code path getting invoked at all ?

LOL, I feel your pain.  It's the wonderful NUMA autobalancing code.  It's been a
while since I looked at the code, but IIRC, it "works" by zapping PTEs for pages that
aren't allocated on the "right" node without checking if page migration is actually
possible.

The actual migration is done on the subsequent page fault.  In this case, the
balancer detects that the page can't be migrated and reinstalls the original PTE.

I don't know if using FOLL_LONGTERM would help?  Again, been a while.  The workaround
I've used in the past is to simply disable the balancer, e.g.

   CONFIG_NUMA_BALANCING=n

or

   numa_balancing=disable

on the kernel command line.