Re: [RFC PATCH 12/15] KVM: x86/mmu: Split large pages when dirty logging is enabled

Peter Xu <peterx@xxxxxxxxxx> · Wed, 1 Dec 2021 10:29:41 +0800

On Tue, Nov 30, 2021 at 05:29:10PM -0800, David Matlack wrote:
> On Tue, Nov 30, 2021 at 5:01 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Tue, Nov 30, 2021, David Matlack wrote:
> > > On Fri, Nov 26, 2021 at 4:01 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > > >
> > > > Hi, David,
> > > >
> > > > On Fri, Nov 19, 2021 at 11:57:56PM +0000, David Matlack wrote:
> > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > > index 2a7564703ea6..432a4df817ec 100644
> > > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > > @@ -1232,6 +1232,9 @@ struct kvm_arch {
> > > > >       hpa_t   hv_root_tdp;
> > > > >       spinlock_t hv_root_tdp_lock;
> > > > >  #endif
> > > > > +
> > > > > +     /* MMU caches used when splitting large pages during VM-ioctls. */
> > > > > +     struct kvm_mmu_memory_caches split_caches;
> > > >
> > > > Are mmu_gfn_array_cache and mmu_pte_list_desc_cache wasted here?  I saw that
> > > > "struct kvm_mmu_memory_cache" still takes up quite a few hundreds of bytes,
> > > > just want to make sure we won't waste them in vain.
> > >
> > > Yes they are wasted right now. But there's a couple of things to keep in mind:
> > >
> > > 1. They are also wasted in every vCPU (in the per-vCPU caches) that
> > > does not use the shadow MMU.
> > > 2. They will (I think) be used eventually when I add Eager Page
> > > Splitting support to the shadow MMU.
> > > 3. split_caches is per-VM so it's only a few hundred bytes per VM.
> > >
> > > If we really want to save the memory the right way forward might be to
> > > make each kvm_mmu_memory_cache a pointer instead of an embedded
> > > struct. Then we can allocate each dynamically only as needed. I can
> > > add that to my TODO list but I don't think it'd be worth blocking this
> > > on it given the points above.

Yeah I never meant to block this series just for this. :)

If there's plan to move forward with shadow mmu support and they'll be needed
at last, then it's good to me to keep it as is.  Maybe before adding the shadow
mmu support we add a comment above the structure?  Depending on whether the
shadow mmu support is in schedule or not, I think.

> > >
> > > >
> > > > [...]
> > > >
> > > > > +int mmu_topup_split_caches(struct kvm *kvm)
> > > > > +{
> > > > > +     struct kvm_mmu_memory_caches *split_caches = &kvm->arch.split_caches;
> > > > > +     int r;
> > > > > +
> > > > > +     assert_split_caches_invariants(kvm);
> > > > > +
> > > > > +     r = kvm_mmu_topup_memory_cache(&split_caches->page_header_cache, 1);
> > > > > +     if (r)
> > > > > +             goto out;
> > > > > +
> > > > > +     r = kvm_mmu_topup_memory_cache(&split_caches->shadow_page_cache, 1);
> > > > > +     if (r)
> > > > > +             goto out;
> > > >
> > > > Is it intended to only top-up with one cache object?  IIUC this means we'll try
> > > > to proactively yield the cpu for each of the huge page split right after the
> > > > object is consumed.
> > > >
> > > > Wondering whether it be more efficient to make it a slightly larger number, so
> > > > we don't overload the memory but also make the loop a bit more efficient.
> > >
> > > IIUC, 1 here is just the min needed for kvm_mmu_topup_memory_cache to
> > > return success. I chose 1 for each because it's the minimum necessary
> > > to make forward progress (split one large page).
> >
> > The @min parameter is minimum number of pages that _must_ be available in the
> > cache, i.e. it's the maximum number of pages that can theoretically be used by
> > whatever upcoming operation is going to be consuming pages from the cache.
> >
> > So '1' is technically correct, but I think it's the wrong choice given the behavior
> > of this code.  E.g. if there's 1 object in the cache, the initial top-up will do
> > nothing,
> 
> This scenario will not happen though, since we free the caches after
> splitting. So, the next time userspace enables dirty logging on a
> memslot and we go to do the initial top-up the caches will have 0
> objects.
> 
> > and then tdp_mmu_split_large_pages_root() will almost immediately drop
> > mmu_lock to topup the cache.  Since the in-loop usage explicitly checks for an
> > empty cache, i.e. any non-zero @min will have identical behavior, I think it makes
> > sense to use KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE _and_ add a comment explaining why.
> 
> If we set the min to KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE,
> kvm_mmu_topup_memory_cache will return ENOMEM if it can't allocate at
> least KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE objects, even though we really
> only need 1 to make forward progress.
> 
> It's a total edge case but there could be a scenario where userspace
> sets the cgroup memory limits so tight that we can't allocate
> KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE objects when splitting the last few
> pages and in the end we only needed 1 or 2 objects to finish
> splitting. In this case we'd end up with a spurious pr_warn and may
> not split the last few pages depending on which cache failed to get
> topped up.

IMHO when -ENOMEM happens, instead of keep trying with 1 shadow sp we should
just bail out even earlier.

Say, if we only have 10 (<40) pages left for shadow sp's use, we'd better make
good use of them lazily to be consumed in follow up page faults when the guest
accessed any of the huge pages, rather than we take them all over to split the
next continuous huge pages assuming it'll be helpful..

>From that POV I have a slight preference over Sean's suggestion because that'll
make us fail earlier.  But I agree it shouldn't be a big deal.

Thanks,

-- 
Peter Xu