Re: [PATCH] KVM: SVM: Consider NUMA affinity when allocating per-CPU save_area

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Tue, 23 Apr 2024 11:00:58 -0700

On Tue, Apr 23, 2024 at 6:30 AM Peter Gonda <pgonda@xxxxxxxxxx> wrote:
>
> On Mon, Apr 22, 2024 at 6:29 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > +Tom, Mike, and Peter
> >
> > On Thu, Apr 18, 2024, Li RongQing wrote:
> > > save_area of per-CPU svm_data are dominantly accessed from their
> > > own local CPUs, so allocate them node-local for performance reason
> > >
> > > Signed-off-by: Li RongQing <lirongqing@xxxxxxxxx>
> > > ---
> > >  arch/x86/kvm/svm/sev.c | 6 +++---
> > >  arch/x86/kvm/svm/svm.c | 2 +-
> > >  arch/x86/kvm/svm/svm.h | 6 +++++-
> > >  3 files changed, 9 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > > index 61a7531..cce8ec7 100644
> > > --- a/arch/x86/kvm/svm/sev.c
> > > +++ b/arch/x86/kvm/svm/sev.c
> > > @@ -3179,13 +3179,13 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
> > >       ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, 1);
> > >  }
> > >
> > > -struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
> > > +struct page *snp_safe_alloc_page_node(struct kvm_vcpu *vcpu, int node)
> > >  {
> > >       unsigned long pfn;
> > >       struct page *p;
> > >
> > >       if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP))
> > > -             return alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> > > +             return alloc_pages_node(node, GFP_KERNEL_ACCOUNT | __GFP_ZERO, 0);
> > >
> > >       /*
> > >        * Allocate an SNP-safe page to workaround the SNP erratum where
> > > @@ -3196,7 +3196,7 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
> > >        * Allocate one extra page, choose a page which is not
> > >        * 2MB-aligned, and free the other.
> > >        */
> > > -     p = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1);
> > > +     p = alloc_pages_node(node, GFP_KERNEL_ACCOUNT | __GFP_ZERO, 1);
> >
> > This made me realize the existing code is buggy.  The allocation for the per-CPU
> > save area shouldn't be accounted.
> >
> > Also, what's the point of taking @vcpu?  It's a nice enough flag to say "this
> > should be accounted", but it's decidely odd.
> >
> > How about we fix both in a single series, and end up with this over 3-4 patches?
> > I.e. pass -1 where vcpu is non-NULL, and the current CPU for the save area.
>
> Looks good to me. Internally we already use GFP_KERNEL for these
> allocations. But we had an issue with split_page() and memcg
> accounting internally. Yosry submitted the following:
>
> +  if (memcg_kmem_charge(p, GFP_KERNEL_ACCOUNT, 0)) {
> +    __free_page(p);
> +    return NULL;
> +  }
>
> Not sure if this is an issue with our kernel or if we should use
> split_page_memcg() here? It was suggested internally but we didn't
> want to backport it.

The referenced internal code is in an older kernel where split_page()
was buggy and did not handle memcg charging correctly (did not call
split_page_memcg()). So we needed to make the allocation with
GFP_KERNEL, then manually account the used page after splitting and
freeing the unneeded page.

AFAICT, this is not needed upstream and the current code is fine.

>
> >
> > struct page *snp_safe_alloc_page(int cpu)
> > {
> >         unsigned long pfn;
> >         struct page *p;
> >         gfp_t gpf;
> >         int node;
> >
> >         if (cpu >= 0) {
> >                 node = cpu_to_node(cpu);
> >                 gfp = GFP_KERNEL;
> >         } else {
> >                 node = NUMA_NO_NODE;
> >                 gfp = GFP_KERNEL_ACCOUNT
> >         }

FWIW, from the pov of someone who has zero familiarity with this code,
passing @cpu only to make inferences about the GFP flags and numa
nodes is confusing tbh.

Would it be clearer to pass in the gfp flags and node id directly? I
think it would make it clearer why we choose to account the allocation
and/or specify a node in some cases but not others.

> >         gfp |= __GFP_ZERO;
> >
> >         if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP))
> >                 return alloc_pages_node(node, gfp, 0);
> >
> >         /*
> >          * Allocate an SNP-safe page to workaround the SNP erratum where
> >          * the CPU will incorrectly signal an RMP violation #PF if a
> >          * hugepage (2MB or 1GB) collides with the RMP entry of a
> >          * 2MB-aligned VMCB, VMSA, or AVIC backing page.
> >          *
> >          * Allocate one extra page, choose a page which is not
> >          * 2MB-aligned, and free the other.
> >          */
> >         p = alloc_pages_node(node, gfp, 1);
> >         if (!p)
> >                 return NULL;
> >
> >         split_page(p, 1);
> >
> >         pfn = page_to_pfn(p);
> >         if (IS_ALIGNED(pfn, PTRS_PER_PMD))
> >                 __free_page(p++);
> >         else
> >                 __free_page(p + 1);
> >
> >         return p;
> > }
> >