Re: [RFC PATCH 00/15] KVM: x86/mmu: Eager Page Splitting for the TDP MMU

Peter Xu <peterx@xxxxxxxxxx> · Wed, 1 Dec 2021 12:10:38 +0800

On Tue, Nov 30, 2021 at 03:22:29PM -0800, David Matlack wrote:
> On Fri, Nov 26, 2021 at 6:13 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > Hi, David,
> >
> > On Fri, Nov 19, 2021 at 11:57:44PM +0000, David Matlack wrote:
> > > This series is a first pass at implementing Eager Page Splitting for the
> > > TDP MMU. For context on the motivation and design of Eager Page
> > > Splitting, please see the RFC design proposal and discussion [1].
> > >
> > > Paolo, I went ahead and added splitting in both the intially-all-set
> > > case (only splitting the region passed to CLEAR_DIRTY_LOG) and the
> > > case where we are not using initially-all-set (splitting the entire
> > > memslot when dirty logging is enabled) to give you an idea of what
> > > both look like.
> > >
> > > Note: I will be on vacation all of next week so I will not be able to
> > > respond to reviews until Monday November 29. I thought it would be
> > > useful to seed discussion and reviews with an early version of the code
> > > rather than putting it off another week. But feel free to also ignore
> > > this until I get back :)
> > >
> > > This series compiles and passes the most basic splitting test:
> > >
> > > $ ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 2 -i 4
> > >
> > > But please operate under the assumption that this code is probably
> > > buggy.
> > >
> > > [1] https://lore.kernel.org/kvm/CALzav=dV_U4r1K9oDq4esb4mpBQDQ2ROQ5zH5wV3KpOaZrRW-A@xxxxxxxxxxxxxx/#t
> >
> > Will there be more numbers to show in the formal patchset?
> 
> Yes definitely. I didn't have a lot of time to test this series, hence
> the RFC status. I'll include more thorough testing and performance
> evaluation in the cover letter for v1.
> 
> 
> > It's interesting to
> > know how "First Pass Dirty Memory Time" will change comparing to the rfc
> > numbers; I can have a feel of it, but still. :) Also, not only how it speedup
> > guest dirty apps, but also some general measurement on how it slows down
> > KVM_SET_USER_MEMORY_REGION (!init-all-set) or CLEAR_LOG (init-all-set) would be
> > even nicer (for CLEAR, I guess the 1st/2nd+ round will have different overhead).
> >
> > Besides that, I'm also wondering whether we should still have a knob for it, as
> > I'm wondering what if the use case is the kind where eager split huge page may
> > not help at all.  What I'm thinking:
> >
> >   - Read-mostly guest overload; split huge page will speed up rare writes, but
> >     at the meantime drag readers down due to huge->small page mappings.
> >
> >   - Writes-over-very-limited-region workload: say we have 1T guest and the app
> >     in the guest only writes 10G part of it.  Hmm not sure whether it exists..
> >
> >   - Postcopy targeted: it means precopy may only run a few iterations just to
> >     send the static pages, so the migration duration will be relatively short,
> >     and the write just didn't spread a lot to the whole guest mem.
> >
> > I don't really think any of the example is strong enough as they're all very
> > corner cased, but just to show what I meant to raise this question on whether
> > unconditionally eager split is the best approach.
> 
> I'd be happy to add a knob if there's a userspace that wants to use
> it. I think the main challenge though is knowing when it is safe to
> disable eager splitting.

Isn't it a performance feature?  Why it'll be not safe?

> For a small deployment where you know the VM workload, it might make
> sense. But for a public cloud provider the only feasible way would be to
> dynamically monitor the guest writing patterns. But then we're back at square
> one because that would require dirty logging. And even then, there's no
> guaranteed way to predict future guest write patterns based on past patterns.

Agreed, what I was thinking was not for public cloud usages, but for the cases
where we can do specific tunings on some specific scenarios.  It normally won't
matter a lot with small or medium sized VMs but extreme use cases.

> 
> The way forward here might be to do a hybrid of 2M and 4K dirty
> tracking (and maybe even 1G). For example, first start dirty logging
> at 2M granularity, and then log at 4K for any specific regions or
> memslots that aren't making progress. We'd still use Eager Page
> Splitting unconditionally though, first to split to 2M and then to
> split to 4K.

Do you mean we'd also offer different granule dirty bitmap to the userspace
too?

I remembered you mentioned 2mb dirty tracking in your rfc series, but I didn't
expect it can be dynamically switched during tracking.  That sounds a very
intersting idea.

Thanks,

-- 
Peter Xu