Re: [PATCH v2 04/12] KVM: arm64: Add kvm_pgtable_stage2_split()

Ricardo Koller <ricarkol@xxxxxxxxxx> · Wed, 15 Feb 2023 09:43:14 -0800

On Thu, Feb 9, 2023 at 2:48 PM Gavin Shan <gshan@xxxxxxxxxx> wrote:
>
> On 2/10/23 3:17 AM, Ricardo Koller wrote:
> > "(> > > +     if (data->mc_capacity < nr_pages)
> >>>> +             return -ENOMEM;
> >>>> +
> >>>> +     phys = kvm_pte_to_phys(pte);
> >>>> +     prot = kvm_pgtable_stage2_pte_prot(pte);
> >>>> +
> >>>> +     ret = kvm_pgtable_stage2_create_unlinked(data->mmu->pgt, &new, phys,
> >>>> +                                              level, prot, mc, force_pte);
> >>>> +     if (ret)
> >>>> +             return ret;
> >>>> +
> >>>> +     if (!stage2_try_break_pte(ctx, data->mmu)) {
> >>>> +             childp = kvm_pte_follow(new, mm_ops);
> >>>> +             kvm_pgtable_stage2_free_unlinked(mm_ops, childp, level);
> >>>> +             mm_ops->put_page(childp);
> >>>> +             return -EAGAIN;
> >>>> +     }
> >>>> +
> >>>> +     /*
> >>>> +      * Note, the contents of the page table are guaranteed to be made
> >>>> +      * visible before the new PTE is assigned because stage2_make_pte()
> >>>> +      * writes the PTE using smp_store_release().
> >>>> +      */
> >>>> +     stage2_make_pte(ctx, new);
> >>>> +     dsb(ishst);
> >>>> +     data->mc_capacity -= nr_pages;
> >>>> +     return 0;
> >>>> +}
> >>>> +
> >>>
> >>> I think it's possible 'data->mc_capability' to be replaced by 'mc->nobjs'
> >>> because they're same thing. With this, we needn't to maintain a duplicate
> >>> 'data->mc_capability' since 'data->mc' has been existing.
> >>
> >> Ah, nice, yes. That would be simpler.
> >>
> >
> > Actually, there's a complication. The memcache details are hidden
> > inside of pgtable.c,
> > so different types of memcaches (for vhe and nvhe) can be used for allocations.
> > Specifically, the memcache objects are passed as an opaque pointer ("void *")
> > and can be used with "struct hyp_pool" and "struct kvm_mmu_memory_cache".
> >
> > So, here are all the options that I can think of:
> >
> >          1. stage2_split_walker() is just used on the VHE case with the
> >          "struct kvm_mmu_memory_cache" memcache, so we could just use it
> >          instead of a "void *":
> >
> >                  kvm_pgtable_stage2_split(..., struct kvm_mmu_memory_cache *mc);
> >
> >          However, it could be used for the NVHE case as well, plus
> >          this would go against the overall design of pgtable.c which tries
> >          to use opaque objects for most things.
> >
> >          2. add a "get_nobjs()" method to both memcaches. This is tricky
> >          because "struct hyp_pool" doesn't directly track its capacity. I
> >          would rather not mess with it.
> >
> >          3. This whole accounting of available pages in the memcache is
> >          needed because of the way I implemented stage2_split_walker() and
> >          the memcache interface.  stage2_split_walker() tries to allocate
> >          as many pages for the new table as allowed by the capacity of the
> >          memcache. The issue with blindingly trying until the allocation
> >          fails is that kvm_mmu_memory_cache_alloc() WARNs and tries to
> >          allocate using GFP_ATOMIC when !nobjs. We don't want to do that,
> >          so we could extend kvm_pgtable_mm_ops.zalloc_page() with a
> >          NO_GFP_ATOMIC_ON_EMPTY (or similar). This flag would have to be
> >          ignored on the hyp side.
> >
> >          4. what this patch is currently doing: tracking the capacity by
> >          hand.
> >
> > I prefer options 4 and 3. WDYT?
> >
>
> Yeah, stage2_split_walker() is currently only used by VHE and it calls to
> stage2_map_walker() to create the unlinked page table, which is shared by
> VHE and nVHE. I think option 3 would be better than 4 because we generally
> just want to fetch the pre-allocated page table, instead of allocating a
> new page table with GFP_ATOMIC. However, I'm also fine with option 4. I

Going with option 4 then for version 3.

> think this can be improved in the future if you agree.

Definitely!

Thanks,
Ricardo

>
> >>
> >>>
> >>>> +int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
> >>>> +                          void *mc, u64 mc_capacity)
> >>>> +{
> >>>> +     struct stage2_split_data split_data = {
> >>>> +             .mmu            = pgt->mmu,
> >>>> +             .memcache       = mc,
> >>>> +             .mc_capacity    = mc_capacity,
> >>>> +     };
> >>>> +
> >>>> +     struct kvm_pgtable_walker walker = {
> >>>> +             .cb     = stage2_split_walker,
> >>>> +             .flags  = KVM_PGTABLE_WALK_LEAF,
> >>>> +             .arg    = &split_data,
> >>>> +     };
> >>>> +
> >>>> +     return kvm_pgtable_walk(pgt, addr, size, &walker);
> >>>> +}
> >>>> +
> >>>>    int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
> >>>>                              struct kvm_pgtable_mm_ops *mm_ops,
> >>>>                              enum kvm_pgtable_stage2_flags flags,
> >>>>
> >>>
>
> Thanks,
> Gavin
>