Re: [RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling

Ben Gardon <bgardon@xxxxxxxxxx> · Thu, 21 Apr 2022 09:30:52 -0700

On Tue, Apr 19, 2022 at 11:36 AM Oliver Upton <oupton@xxxxxxxxxx> wrote:
>
> On Tue, Apr 19, 2022 at 10:57 AM Ben Gardon <bgardon@xxxxxxxxxx> wrote:
> >
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@xxxxxxxxxx> wrote:
> > >
> > > Presently KVM only takes a read lock for stage 2 faults if it believes
> > > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > > predictably can pile up all the vCPUs in a sufficiently large VM.
> > >
> > > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > > MMU protected by the combination of a read-write lock and RCU, allowing
> > > page walkers to traverse in parallel.
> > >
> > > This series is strongly inspired by the mechanics of the TDP MMU,
> > > making use of RCU to protect parallel walks. Note that the TLB
> > > invalidation mechanics are a bit different between x86 and ARM, so we
> > > need to use the 'break-before-make' sequence to split/collapse a
> > > block/table mapping, respectively.
> > >
> > > Nonetheless, using atomics on the break side allows fault handlers to
> > > acquire exclusive access to a PTE (lets just call it locked). Once the
> > > PTE lock is acquired it is then safe to assume exclusive access.
> > >
> > > Special consideration is required when pruning the page tables in
> > > parallel. Suppose we are collapsing a table into a block. Allowing
> > > parallel faults means that a software walker could be in the middle of
> > > a lower level traversal when the table is unlinked. Table
> > > walkers that prune the paging structures must now 'lock' all descendent
> > > PTEs, effectively asserting exclusive ownership of the substructure
> > > (no other walker can install something to an already locked pte).
> > >
> > > Additionally, for parallel walks we need to punt the freeing of table
> > > pages to the next RCU sync, as there could be multiple observers of the
> > > table until all walkers exit the RCU critical section. For this I
> > > decided to cram an rcu_head into page private data for every table page.
> > > We wind up spending a bit more on table pages now, but lazily allocating
> > > for rcu callbacks probably doesn't make a lot of sense. Not only would
> > > we need a large cache of them (think about installing a level 1 block)
> > > to wire up callbacks on all descendent tables, but we also then need to
> > > spend memory to actually free memory.
> >
> > FWIW we used a similar approach in early versions of the TDP MMU, but
> > instead of page->private used page->lru so that more metadata could be
> > stored in page->private.
> > Ultimately that ended up being too limiting and we decided to switch
> > to just using the associated struct kvm_mmu_page as the list element.
> > I don't know if ARM has an equivalent construct though.
>
> ARM currently doesn't have any metadata it needs to tie with the table
> pages. We just do very basic page reference counting for every valid
> PTE. I was going to link together pages (hence the page header), but
> we actually do not have a functional need for it at the moment. In
> fact, struct page::rcu_head would probably fit the bill and we can
> avoid extra metadata/memory for the time being.

Ah, right! I page::rcu_head was the field I was thinking of.

>
> Perhaps best to keep it simple and do the rest when we have a genuine
> need for it.

Completely agree. I'm surprised that ARM doesn't have a need for a
metadata structure associated with each page of the stage 2 paging
structure, but if you don't need it, that definitely makes things
simpler.

>
> --
> Thanks,
> Oliver