Re: [RFC PATCH 06/17] KVM: arm64: Implement break-before-make sequence for parallel walks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Apr 21, 2022 at 11:52 AM Oliver Upton <oupton@xxxxxxxxxx> wrote:
>
> On Thu, Apr 21, 2022 at 09:57:32AM -0700, Ben Gardon wrote:
> > On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton@xxxxxxxxxx> wrote:
> > >
> > > The ARM architecture requires that software use the 'break-before-make'
> > > sequence whenever memory is being remapped. An additional requirement of
> > > parallel page walks is a mechanism to ensure exclusive access to a pte,
> > > thereby avoiding two threads changing the pte and invariably stomping on
> > > one another.
> > >
> > > Roll the two concepts together into a new helper to implement the
> > > 'break' sequence. Use a special invalid pte value to indicate that the
> > > pte is under the exclusive control of a thread. If software walkers are
> > > traversing the tables in parallel, use an atomic compare-exchange to
> > > break the pte. Retry execution on a failed attempt to break the pte, in
> > > the hopes that either the instruction will succeed or the pte lock will
> > > be successfully acquired.
> > >
> > > Avoid unnecessary DSBs and TLBIs by only completing the sequence if the
> > > evicted pte was valid. For counted non-table ptes drop the reference
> > > immediately. Otherwise, references on tables are dropped in post-order
> > > traversal as the walker must recurse on the pruned subtree.
> > >
> > > All of the new atomics do nothing (for now), as there are a few other
> > > bits of the map walker that need to be addressed before actually walking
> > > in parallel.
> >
> > I want to make sure I understand the make before break / PTE locking
> > patterns here.
> > Please check my understanding of the following cases:
> >
> > Case 1: Change a leaf PTE (for some reason)
> > 1. Traverse the page table to the leaf
> > 2. Invalidate the leaf PTE, replacing it with a locked PTE
> > 3. Flush TLBs
> > 4. Replace the locked PTE with the new value
> >
> > In this case, no need to lock the parent SPTEs, right? This is pretty simple.
>
> Right, if we're changing the OA of a leaf PTE. If we are just adjusting
> attributes on a leaf we go through stage2_attr_walker(), which skips
> step 2 and does the rest in this order: 1, 4, 3.
>
> > Case 2:  Drop a page table
> > 1. Traverse to some non-leaf PTE
> > 2. Lock the PTE, invalidating it
> > 3. Recurse into the child page table
> > 4. Lock the PTEs in the child page table. (We need to lock ALL the
> > PTEs here right? I don't think we'd get away with locking only the
> > valid ones)
>
> Right. We can just skip some of the TLBI/DSB dance when making an
> invalid->invalid transition.
>
> > 5. Flush TLBs
> > 6. Unlock the PTE from 2
> > 7. Free the child page after an RCU grace period (via callback)
> >
> > Case 3: Drop a range of leaf PTEs
> > 1. Traverse the page table to the first leaf
> > 2. For each leaf in the range:
> >         a. Invalidate the leaf PTE, replacing it with a locked PTE
> > 3. Flush TLBs
> > 4. unlock the locked PTEs
> >
> > In this case we have to lock ALL PTEs in the range too, right? My
> > worry about the whole locking scheme is making sure each thread
> > correctly remembers which PTEs it locked versus which might have been
> > locked by other threads.
>
> Isn't exclusivity accomplished by checking what you get back from the
> xchg()? If I get a locked PTE back, some other thread owns the PTE. If I
> get anything else, then I've taken ownership of that PTE.

That's true if you only modify one PTE at a time, but if you want to
batch flushes by:
1. Locking a bunch of PTEs
2. TLB invalidate
3. Set them to some new value (e.g. 0)
Then you need to track which ones you locked. If you locked an entire
contiguous region, that works, but you need some way to ensure you
don't think you locked a pte locked by another thread.

>
> > On x86 we solved this by only locking one SPTE at a time, flushing,
> > then fixing it, but if you're locking a bunch at once it might get
> > complicated.
> > Making this locking scheme work without demolishing performance seems hard.
>
> We only change at most a single active PTE per fault on the stage 2 MMU.
> We do one of three things on that path:
>
>  1. Install a page/block PTE to an empty PTE
>  2. Replace a table PTE with a block PTE
>  3. Replace a block PTE with a table PTE
>
> 1 is pretty cheap and can skip flushes altogether.
>
> 2 only requires a single TLBI (a big, painful flush of the stage 2 context),
> but child PTEs needn't be flushed.
>
> 3 also requires a single TLBI, but can be done with an IPA and level
> hint.
>
> Perhaps the answer is to push teardown into the rcu callback altogether,
> IOW don't mess with links in the subtree until then. At that point
> there's no need for TLBIs nor atomics.
>
> --
> Thanks,
> Oliver



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux