Re: [RFC PATCH 13/15] KVM: x86/mmu: Split large pages during CLEAR_DIRTY_LOG

David Matlack <dmatlack@xxxxxxxxxx> · Wed, 1 Dec 2021 14:14:27 -0800

On Tue, Nov 30, 2021 at 8:04 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Tue, Nov 30, 2021 at 04:17:01PM -0800, David Matlack wrote:
> > On Tue, Nov 30, 2021 at 4:16 PM David Matlack <dmatlack@xxxxxxxxxx> wrote:
> > >
> > > On Fri, Nov 26, 2021 at 4:17 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > > >
> > > > On Fri, Nov 19, 2021 at 11:57:57PM +0000, David Matlack wrote:
> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > index 6768ef9c0891..4e78ef2dd352 100644
> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > > @@ -1448,6 +1448,12 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> > > > >               gfn_t start = slot->base_gfn + gfn_offset + __ffs(mask);
> > > > >               gfn_t end = slot->base_gfn + gfn_offset + __fls(mask);
> > > > >
> > > > > +             /*
> > > > > +              * Try to proactively split any large pages down to 4KB so that
> > > > > +              * vCPUs don't have to take write-protection faults.
> > > > > +              */
> > > > > +             kvm_mmu_try_split_large_pages(kvm, slot, start, end, PG_LEVEL_4K);
> > > > > +
> > > > >               kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M);
> > > > >
> > > > >               /* Cross two large pages? */
> > > >
> > > > Is it intended to try split every time even if we could have split it already?
> > > > As I remember Paolo mentioned that we can skip split if it's not the 1st
> > > > CLEAR_LOG on the same range, and IIUC that makes sense.
> > > >
> > > > But indeed I don't see a trivial way to know whether this is the first clear of
> > > > this range.  Maybe we can maintain "how many huge pages are there under current
> > > > kvm_mmu_page node" somehow?  Then if root sp has the counter==0, then we can
> > > > skip it.  Just a wild idea..
> > > >
> > > > Or maybe it's intended to try split unconditionally for some reason?  If so
> > > > it'll be great to mention that either in the commit message or in comments.
> > >
> > > Thanks for calling this out. Could the same be said about the existing
> > > code that unconditionally tries to write-protect 2M+ pages?
>
> They're different because wr-protect can be restored (to be not-wr-protected)
> when vcpu threads write to the pages, so they need to be always done.

That's true for 4K pages, but not for write-protecting 2M+ pages
(which is what we're discussing here). Once KVM write-protects a 2M+
page, it should never need to write-protect it again, but we always
try to here. Same goes with splitting.

>
> For huge page split - when it happened during dirty tracking it'll not be
> recovered anymore, so it's a one-time thing.
>
> > > I aimed to keep parity with the write-protection calls (always try to split
> > > before write-protecting) but I agree there might be opportunities available
> > > to skip altogether.
>
> So IMHO it's not about parity but it could be about how easy can it be
> implemented, and whether it'll be worth it to add that complexity.

Agreed.

>
> Besides the above accounting idea per-sp, we can have other ways to do this
> too, e.g., keeping a bitmap showing which range has been split: that bitmap
> will be 2M in granule for x86 because that'll be enough.  We init-all-ones for
> the bitmap when start logging for a memslot.
>
> But again maybe it turns out we don't really want that complexity.
>
> IMHO a good start could be the perf numbers (which I asked in the cover letter)
> comparing the overhead of 2nd+ iterations of CLEAR_LOG with/without eager page
> split.

Ack. I'll be sure to include these in v1!

>
> > >
> > > By the way, looking at this code again I think I see some potential bugs:
> > >  - I don't think I ever free split_caches in the initially-all-set case.
>
> I saw that it's freed in kvm_mmu_try_split_large_pages(), no?

Ah yes you are right. I misremembered how I implemented it and thought
we kept the split_caches around across calls to CLEAR_LOG. (We
probably should TBH. The current implementation is quite wasteful.)

>
> > >  - What happens if splitting fails the CLEAR_LOG but succeeds the
> > > CLEAR_LOG?
> >
> > Gah, meant to say "first CLEAR_LOG" and "second CLEAR_LOG" here.
> >
> > > We would end up propagating the write-protection on the 2M
> > > page down to the 4K page. This might cause issues if using PML.
>
> Hmm looks correct.. I'm wondering what will happen with that.
>
> Firstly this should be rare as the 1st split should in 99% cases succeed.
>
> Then if split failed at the 1st attempt, we wr-protected sptes even during pml
> during the split.  When written, we'll go the fast page fault and record the
> writes too, I think, as we'll apply dirty bit to the new spte so I think it'll
> just skip pml.  Looks like we'll be using a mixture of pml+wp but all dirty
> will still be captured as exptected?..

That's what I was hoping for. I'll double check for v1.

>
> There could be leftover wp when stopping dirty logging, but that seems not
> directly harmful too.  It'll make things a bit messed up, at least.
>
> --
> Peter Xu
>