Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 17 Jun 2024 11:37:27 -0700

On Mon, Jun 17, 2024, James Houghton wrote:
> On Fri, Jun 14, 2024 at 4:17 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Fri, Jun 14, 2024, James Houghton wrote:
> > > On Fri, Jun 14, 2024 at 9:13 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > >
> > > > On Thu, Jun 13, 2024, James Houghton wrote:
> > > > > I wonder if this still makes sense if whether or not an MMU is "fast"
> > > > > is determined by how contended some lock(s) are at the time.
> > > >
> > > > No.  Just because a lock wasn't contended on the initial aging doesn't mean it
> > > > won't be contended on the next round.  E.g. when using KVM x86's shadow MMU, which
> > > > takes mmu_lock for write for all operations, an aging operation could get lucky
> > > > and sneak in while mmu_lock happened to be free, but then get stuck behind a large
> > > > queue of operations.
> > > >
> > > > The fast-ness needs to be predictable and all but guaranteed, i.e. lockless or in
> > > > an MMU that takes mmu_lock for read in all but the most rare paths.
> > >
> > > Aging and look-around themselves only use the fast-only notifiers, so
> > > they won't ever wait on a lock (well... provided KVM is written like
> > > that, which I think is a given).
> >
> > Regarding aging, is that actually the behavior that we want?  I thought the plan
> > is to have the initial test look at all MMUs, i.e. be potentially slow, but only
> > do the lookaround if it can be fast.  IIUC, that was Yu's intent (and peeking back
> > at v2, that is indeed the case, unless I'm misreading the code).
> 
> I believe what I said is correct. There are three separate things going on here:
> 
> 1. Aging (when we hit the low watermark, scan PTEs to find young pages)
> 2. Eviction (pick a page to evict; if it is definitely not young, evict it)
> 3. Look-around (upon finding a page is young upon attempted eviction,
> check adjacent pages if they are young too)

Ah, I now see the difference between #1 and #2, and your responses make a lot more
sense.  Thanks!

> > If KVM _never_ consults shadow (nested TDP) MMUs, then a VM running an L2 will
> > end up with hot pages (used by L2) swapped out.
> 
> The shadow MMU is consulted at eviction time -- only at eviction time.
> So pages used by L2 won't be swapped out unless they're still cold at
> eviction time.
> 
> In my (and Yu's) head, not being able to do aging for nested TDP is ok
> because running nested VMs is much more rare than running non-nested
> VMs. And in the non-nested case, being able to do aging is a strict
> improvement over what we have now.

Yes and no.  Running nested VMs is indeed rare when viewing them as a percentage
of all VMs in the fleet, but for many use cases, the primary workload of a VM is
to run nested VMs.  E.g. say x% of VMs in the fleet run nested VMs, where 'x' is
likely very small, but for those x% VMs, they run nested VMs 99% of the time
(completely made up number).

So yes, I completely agree that aging for non-nested VMs is a strict improvement,
but I also think don't think we should completely dismiss nested VMs as a problem
not worth solving.

> We could look into being able to do aging with the shadow MMU, but I
> don't think that should necessarily block this series.

...

> > Ooh!  Actually, after fiddling a bit to see how feasible fast-aging in the shadow
> > MMU would be, I'm pretty sure we can do straight there for nested TDP.  Or rather,
> > I suspect/hope we can get close enough for an initial merge, which would allow
> > aging_is_fast to be a property of the mmu_notifier, i.e. would simplify things
> > because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for each notification.
> >
> > Walking KVM's rmaps requires mmu_lock because adding/removing rmap entries is done
> > in such a way that a lockless walk would be painfully complex.  But if there is
> > exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] points directly at
> > that one SPTE.  And with nested TDP, unless L1 is doing something uncommon, e.g.
> > mapping the same page into multiple L2s, that overwhelming vast majority of rmaps
> > have only one entry.  That's not the case for legacy shadow paging because kernels
> > almost always map a pfn using multiple virtual addresses, e.g. Linux's direct map
> > along with any userspace mappings.

...

> Hmm, interesting. I need to spend a little bit more time digesting this.
> 
> Would you like to see this included in v6? (It'd be nice to avoid the
> WAS_FAST stuff....) Should we leave it for a later series? I haven't
> formed my own opinion yet.

I would say it depends on the viability and complexity of my idea.  E.g. if it
pans out more or less like my rough sketch, then it's probably worth taking on
the extra code+complexity in KVM to avoid the whole WAS_FAST goo.

Note, if we do go this route, the implementation would need to be tweaked to
handle the difference in behavior between aging and last-minute checks for eviction,
which I obviously didn't understand when I threw together that hack-a-patch.

I need to think more about how best to handle that though, e.g. skipping GFNs with
multiple mappings is probably the worst possible behavior, as we'd risk evicting
hot pages.  But falling back to taking mmu_lock for write isn't all that desirable
either.