On Tue, 05 Mar 2024 13:29:08 +0000, Ganapatrao Kulkarni <gankulkarni@xxxxxxxxxxxxxxxxxxxxxx> wrote: > > > > On 05-03-2024 04:43 pm, Marc Zyngier wrote: > > [re-sending with kvmarm@ fixed] > > > > On Tue, 05 Mar 2024 05:46:06 +0000, > > Ganapatrao Kulkarni <gankulkarni@xxxxxxxxxxxxxxxxxxxxxx> wrote: > >> > >> As per 'commit 178a6915434c ("KVM: arm64: nv: Unmap/flush shadow stage 2 > > > > $ git describe --contains 178a6915434c --match=v\* > > fatal: cannot describe '178a6915434c141edefd116b8da3d55555ea3e63' > > > > My bad(I would have been more verbose), I missed to mention that this > patch is on top of NV-V11 patch series. > > > This commit simply doesn't exist upstream. It only lives in a > > now deprecated branch that will never be merged. > > > >> page tables")', when ever there is unmap of pages that > >> are mapped to L1, they are invalidated from both L1 S2-MMU and from > >> all the active shadow/L2 S2-MMU tables. Since there is no mapping > >> to invalidate the IPAs of Shadow S2 to a page, there is a complete > >> S2-MMU page table walk and invalidation is done covering complete > >> address space allocated to a L2. This has performance impacts and > >> even soft lockup for NV(L1 and L2) boots with higher number of > >> CPUs and large Memory. > >> > >> Adding a lookup table of mapping of Shadow IPA to Canonical IPA > >> whenever a page is mapped to any of the L2. While any page is > >> unmaped, this lookup is helpful to unmap only if it is mapped in > >> any of the shadow S2-MMU tables. Hence avoids unnecessary long > >> iterations of S2-MMU table walk-through and invalidation for the > >> complete address space. > > > > All of this falls in the "premature optimisation" bucket. Why should > > we bother with any of this when not even 'AT S1' works correctly, > > Hmm, I am not aware of this, is this something new issue of V11? it's been there since v0. All we have is a trivial implementation that doesn't survive the S1 page-tables being swapped out. It requires a full S1 PTW to be written. > > > making it trivial to prevent a guest from making forward progress? You > > also show no numbers that would hint at a measurable improvement under > > any particular workload. > > This patch is avoiding long iterations of unmap which was resulting in > soft-lockup, when tried L1 and L2 with 192 cores. > Fixing soft lockup isn't a required fix for feature enablement? No. All we care is correctness, not performance. Addressing soft-lockups is *definitely* a performance issue, which I'm 100% happy to ignore. [...] > >> +static inline bool kvm_is_l1_using_shadow_s2(struct kvm_vcpu > >> *vcpu) > >> +{ > >> + return (vcpu->arch.hw_mmu != &vcpu->kvm->arch.mmu); > >> +} > > > > Isn't that the very definition of "!in_hyp_ctxt()"? You are abusing > > "!in_hyp_ctxt()" isn't true for non-NV case also? Surely you don't try to use this in non-NV contexts, right? Why would you try to populate a shadow reverse-map outside of a NV context? > This function added to know that L1 is NV enabled and using shadow S2. > > > the hw_mmu pointer to derive something, but the source of truth is the > > translation regime, as defined by HCR_EL2.{E2H,TGE} and PSTATE.M. > > > > OK, I can try HCR_EL2.{E2H,TGE} and PSTATE.M instead of hw_mmu in next > version. No. Use is_hyp_ctxt(). [...] > >> index 61bdd8798f83..3948681426a0 100644 > >> --- a/arch/arm64/kvm/mmu.c > >> +++ b/arch/arm64/kvm/mmu.c > >> @@ -1695,6 +1695,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > >> memcache, > >> KVM_PGTABLE_WALK_HANDLE_FAULT | > >> KVM_PGTABLE_WALK_SHARED); > >> + if ((nested || kvm_is_l1_using_shadow_s2(vcpu)) && !ret) { > > > > I don't understand this condition. If nested is non-NULL, it's because > > we're using a shadow S2. So why the additional condition? > > No, nested is set only for L2, for L1 it is not. > To handle L1 shadow S2 case, I have added this condition. But there is *no shadow* for L1 at all. The only way to get a shadow is to be outside of the EL2(&0) translation regime. El2(&0) itself is always backed by the canonical S2. By definition, L1 does not run with a S2 it is in control of. No S2, no shadow. [...] > > What guarantees that the mapping you have for L1 has the same starting > > address as the one you have for L2? L1 could have a 2MB mapping and L2 > > only 4kB *in the middle*. > > IIUC, when a page is mapped to 2MB in L1, it won't be > mapped to L2 and we iterate with the step of PAGE_SIZE and we should > be hitting the L2's IPA in lookup table, provided the L2 page falls in > unmap range. But then how do you handle the reverse (4kB at L1, 2MB at L2)? Without tracking of the *intersection*, this fails to be correctly detected. This is TLB matching 101. [...] > >> + while (start < end) { > >> + size = PAGE_SIZE; > >> + /* > >> + * get the Shadow IPA if the page is mapped > >> + * to L1 and also mapped to any of active L2. > >> + */ > > > > Why is L1 relevant here? > > We do map while L1 boots(early stage) in shadow S2, at that moment > if the L1 mapped page is unmapped/migrated we do need to unmap from > L1's S2 table also. Sure. But you can also get a page that is mapped in L2 and not mapped in the canonical S2, which is L1's. I more and more feel that you have a certain misconception of how L1 gets its pages mapped. > > > > >> + ret = get_shadow_ipa(mmu, start, &shadow_ipa, &size); > >> + if (ret) > >> + kvm_unmap_stage2_range(mmu, shadow_ipa, size); > >> + start += size; > >> + } > >> + } > >> + } > >> +} > >> + > >> /* expects kvm->mmu_lock to be held */ > >> void kvm_nested_s2_flush(struct kvm *kvm) > >> { > > > > There are a bunch of worrying issues with this patch. But more > > importantly, this looks like a waste of effort until the core issues > > that NV still has are solved, and I will not consider anything of the > > sort until then. > > OK thanks for letting us know, I will pause the work on V2 of this > patch until then. > > > > > I get the ugly feeling that you are trying to make it look as if it > > was "production ready", which it won't be for another few years, > > specially if the few interested people (such as you) are ignoring the > > core issues in favour of marketing driven features ("make it fast"). > > > > What are the core issues (please forgive me if you mentioned already)? > certainly we will prioritise them than this. AT is a big one. Maintenance interrupts are more or less broken. I'm slowly plugging PAuth, but there's no testing whatsoever (running Linux doesn't count). Lack of SVE support is also definitely a blocker. Thanks, M. -- Without deviation from the norm, progress is not possible.