Re: [PATCH v2 7/7] KVM: arm64: Create a fast stage-2 unmap path

Raghavendra Rao Ananta <rananta@xxxxxxxxxx> · Tue, 4 Apr 2023 10:52:01 -0700



On Wed, Mar 29, 2023 at 5:42 PM Oliver Upton <oliver.upton@xxxxxxxxx> wrote:
>
> On Mon, Feb 06, 2023 at 05:23:40PM +0000, Raghavendra Rao Ananta wrote:
> > The current implementation of the stage-2 unmap walker
> > traverses the entire page-table to clear and flush the TLBs
> > for each entry. This could be very expensive, especially if
> > the VM is not backed by hugepages. The unmap operation could be
> > made efficient by disconnecting the table at the very
> > top (level at which the largest block mapping can be hosted)
> > and do the rest of the unmapping using free_removed_table().
> > If the system supports FEAT_TLBIRANGE, flush the entire range
> > that has been disconnected from the rest of the page-table.
> >
> > Suggested-by: Ricardo Koller <ricarkol@xxxxxxxxxx>
> > Signed-off-by: Raghavendra Rao Ananta <rananta@xxxxxxxxxx>
> > ---
> >  arch/arm64/kvm/hyp/pgtable.c | 44 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 44 insertions(+)
> >
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 0858d1fa85d6b..af3729d0971f2 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -1017,6 +1017,49 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> >       return 0;
> >  }
> >
> > +/*
> > + * The fast walker executes only if the unmap size is exactly equal to the
> > + * largest block mapping supported (i.e. at KVM_PGTABLE_MIN_BLOCK_LEVEL),
> > + * such that the underneath hierarchy at KVM_PGTABLE_MIN_BLOCK_LEVEL can
> > + * be disconnected from the rest of the page-table without the need to
> > + * traverse all the PTEs, at all the levels, and unmap each and every one
> > + * of them. The disconnected table is freed using free_removed_table().
> > + */
> > +static int fast_stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > +                            enum kvm_pgtable_walk_flags visit)
> > +{
> > +     struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
> > +     kvm_pte_t *childp = kvm_pte_follow(ctx->old, mm_ops);
> > +     struct kvm_s2_mmu *mmu = ctx->arg;
> > +
> > +     if (!kvm_pte_valid(ctx->old) || ctx->level != KVM_PGTABLE_MIN_BLOCK_LEVEL)
> > +             return 0;
> > +
> > +     if (!stage2_try_break_pte(ctx, mmu))
> > +             return -EAGAIN;
> > +
> > +     /*
> > +      * Gain back a reference for stage2_unmap_walker() to free
> > +      * this table entry from KVM_PGTABLE_MIN_BLOCK_LEVEL - 1.
> > +      */
> > +     mm_ops->get_page(ctx->ptep);
>
> Doesn't this run the risk of a potential UAF if the refcount was 1 before
> calling stage2_try_break_pte()? IOW, stage2_try_break_pte() will drop
> the refcount to 0 on the page before this ever gets called.
>
> Also, AFAICT this misses the CMOs that are required on systems w/o
> FEAT_FWB. Without them it is possible that the host will read something
> other than what was most recently written by the guest if it is using
> noncacheable memory attributes at stage-1.
>
> I imagine the actual bottleneck is the DSB required after every
> CMO/TLBI. Theoretically, the unmap path could be updated to:
>
>  - Perform the appropriate CMOs for every valid leaf entry *without*
>    issuing a DSB.
>
>  - Elide TLBIs entirely that take place in the middle of the walk
>
>  - After the walk completes, dsb(ish) to guarantee that the CMOs have
>    completed and the invalid PTEs are made visible to the hardware
>    walkers. This should be done implicitly by the TLBI implementation
>
>  - Invalidate the [addr, addr + size) range of IPAs
>
> This would also avoid over-invalidating stage-1 since we blast the
> entire stage-1 context for every stage-2 invalidation. Thoughts?
>
Correct me if I'm wrong, but if we invalidate the TLB after the walk
is complete, don't you think there's a risk of race if the guest can
hit in the TLB even though the page was unmapped?

Thanks,
Raghavendra

Raghavendra
> > +     mm_ops->free_removed_table(childp, ctx->level);
> > +     return 0;
> > +}
> > +
>
> --
> Thanks,
> Oliver