Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled

David Matlack <dmatlack@xxxxxxxxxx> · Tue, 30 Nov 2021 17:29:10 -0800

On Tue, Nov 30, 2021 at 5:01 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Tue, Nov 30, 2021, David Matlack wrote:
> > On Fri, Nov 26, 2021 at 4:01 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > >
> > > Hi, David,
> > >
> > > On Fri, Nov 19, 2021 at 11:57:56PM +0000, David Matlack wrote:
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index 2a7564703ea6..432a4df817ec 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1232,6 +1232,9 @@ struct kvm_arch {
> > > >       hpa_t   hv_root_tdp;
> > > >       spinlock_t hv_root_tdp_lock;
> > > >  #endif
> > > > +
> > > > +     /* MMU caches used when splitting large pages during VM-ioctls. */
> > > > +     struct kvm_mmu_memory_caches split_caches;
> > >
> > > Are mmu_gfn_array_cache and mmu_pte_list_desc_cache wasted here?  I saw that
> > > "struct kvm_mmu_memory_cache" still takes up quite a few hundreds of bytes,
> > > just want to make sure we won't waste them in vain.
> >
> > Yes they are wasted right now. But there's a couple of things to keep in mind:
> >
> > 1. They are also wasted in every vCPU (in the per-vCPU caches) that
> > does not use the shadow MMU.
> > 2. They will (I think) be used eventually when I add Eager Page
> > Splitting support to the shadow MMU.
> > 3. split_caches is per-VM so it's only a few hundred bytes per VM.
> >
> > If we really want to save the memory the right way forward might be to
> > make each kvm_mmu_memory_cache a pointer instead of an embedded
> > struct. Then we can allocate each dynamically only as needed. I can
> > add that to my TODO list but I don't think it'd be worth blocking this
> > on it given the points above.
> >
> > >
> > > [...]
> > >
> > > > +int mmu_topup_split_caches(struct kvm *kvm)
> > > > +{
> > > > +     struct kvm_mmu_memory_caches *split_caches = &kvm->arch.split_caches;
> > > > +     int r;
> > > > +
> > > > +     assert_split_caches_invariants(kvm);
> > > > +
> > > > +     r = kvm_mmu_topup_memory_cache(&split_caches->page_header_cache, 1);
> > > > +     if (r)
> > > > +             goto out;
> > > > +
> > > > +     r = kvm_mmu_topup_memory_cache(&split_caches->shadow_page_cache, 1);
> > > > +     if (r)
> > > > +             goto out;
> > >
> > > Is it intended to only top-up with one cache object?  IIUC this means we'll try
> > > to proactively yield the cpu for each of the huge page split right after the
> > > object is consumed.
> > >
> > > Wondering whether it be more efficient to make it a slightly larger number, so
> > > we don't overload the memory but also make the loop a bit more efficient.
> >
> > IIUC, 1 here is just the min needed for kvm_mmu_topup_memory_cache to
> > return success. I chose 1 for each because it's the minimum necessary
> > to make forward progress (split one large page).
>
> The @min parameter is minimum number of pages that _must_ be available in the
> cache, i.e. it's the maximum number of pages that can theoretically be used by
> whatever upcoming operation is going to be consuming pages from the cache.
>
> So '1' is technically correct, but I think it's the wrong choice given the behavior
> of this code.  E.g. if there's 1 object in the cache, the initial top-up will do
> nothing,

This scenario will not happen though, since we free the caches after
splitting. So, the next time userspace enables dirty logging on a
memslot and we go to do the initial top-up the caches will have 0
objects.

> and then tdp_mmu_split_large_pages_root() will almost immediately drop
> mmu_lock to topup the cache.  Since the in-loop usage explicitly checks for an
> empty cache, i.e. any non-zero @min will have identical behavior, I think it makes
> sense to use KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE _and_ add a comment explaining why.

If we set the min to KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
kvm_mmu_topup_memory_cache will return ENOMEM if it can't allocate at
least KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE objects, even though we really
only need 1 to make forward progress.

It's a total edge case but there could be a scenario where userspace
sets the cgroup memory limits so tight that we can't allocate
KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE objects when splitting the last few
pages and in the end we only needed 1 or 2 objects to finish
splitting. In this case we'd end up with a spurious pr_warn and may
not split the last few pages depending on which cache failed to get
topped up.

>
> > No matter what you pass for min kvm_mmu_topup_memory_cache() will
> > still always try to allocate KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE
> > objects.
>
> No, it will try to allocate KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE if and only if there
> are fewer than @min objects in the cache.