Re: [PATCH v11 17/43] KVM: arm64: nv: Support multiple nested Stage-2 mmu structures

Marc Zyngier <maz@xxxxxxxxxx> · Wed, 31 Jan 2024 13:50:25 +0000

On Wed, 31 Jan 2024 09:39:34 +0000,
Ganapatrao Kulkarni <gankulkarni@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> 
> 
> Hi Marc,
> 
> On 25-01-2024 02:28 pm, Marc Zyngier wrote:
> > On Thu, 25 Jan 2024 08:14:32 +0000,
> > Ganapatrao Kulkarni <gankulkarni@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> >> 
> >> 
> >> Hi Marc,
> >> 
> >> On 23-01-2024 07:56 pm, Marc Zyngier wrote:
> >>> Hi Ganapatrao,
> >>> 
> >>> On Tue, 23 Jan 2024 09:55:32 +0000,
> >>> Ganapatrao Kulkarni <gankulkarni@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> >>>> 
> >>>> Hi Marc,
> >>>> 
> >>>>> +void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
> >>>>> +{
> >>>>> +	if (is_hyp_ctxt(vcpu)) {
> >>>>> +		vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu;
> >>>>> +	} else {
> >>>>> +		write_lock(&vcpu->kvm->mmu_lock);
> >>>>> +		vcpu->arch.hw_mmu = get_s2_mmu_nested(vcpu);
> >>>>> +		write_unlock(&vcpu->kvm->mmu_lock);
> >>>>> +	}
> >>>> 
> >>>> Due to race, there is a non-existing L2's mmu table is getting loaded
> >>>> for some of vCPU while booting L1(noticed with L1 boot using large
> >>>> number of vCPUs). This is happening since at the early stage the
> >>>> e2h(hyp-context) is not set and trap to eret of L1 boot-strap code
> >>>> resulting in context switch as if it is returning to L2(guest enter)
> >>>> and loading not initialized mmu table on those vCPUs resulting in
> >>>> unrecoverable traps and aborts.
> >>> 
> >>> I'm not sure I understand the problem you're describing here.
> >>> 
> >> 
> >> IIUC, When the S2 fault happens, the faulted vCPU gets the pages from
> >> qemu process and maps in S2 and copies the code to allocated
> >> memory. Mean while other vCPUs which are in race to come online, when
> >> they switches over to dummy S2 finds the mapping and returns to L1 and
> >> subsequent execution does not fault instead fetches from memory where
> >> no code exists yet(for some) and generates stage 1 instruction abort
> >> and jumps to abort handler and even there is no code exist and keeps
> >> aborting. This is happening on random vCPUs(no pattern).
> > 
> > Why is that any different from the way we handle faults in the
> > non-nested case? If there is a case where we can map the PTE at S2
> > before the data is available, this is a generic bug that can trigger
> > irrespective of NV.
> > 
> >> 
> >>> What is the race exactly? Why isn't the shadow S2 good enough? Not
> >>> having HCR_EL2.VM set doesn't mean we can use the same S2, as the TLBs
> >>> are tagged by a different VMID, so staying on the canonical S2 seems
> >>> wrong.
> >> 
> >> IMO, it is unnecessary to switch-over for first ERET while L1 is
> >> booting and repeat the faults and page allocation which is anyway
> >> dummy once L1 switches to E2H.
> > 
> > It is mandated by the architecture. EL1 is, by definition, a different
> > translation regime from EL2. So we *must* have a different S2, because
> > that defines the boundaries of TLB creation and invalidation. The
> > fact that these are the same pages is totally irrelevant.
> > 
> >> Let L1 use its S2 always which is created by L0. Even we should
> >> consider avoiding the entry created for L1 in array(first entry in the
> >> array) of S2-MMUs and avoid unnecessary iteration/lookup while unmap
> >> of NestedVMs.
> > 
> > I'm sorry, but this is just wrong. You are merging the EL1 and EL2
> > translation regimes, which is not acceptable.
> > 
> >> I am anticipating this unwanted switch-over wont happen when we have
> >> NV2 only support in V12?
> > 
> > V11 is already NV2 only, so I really don't get what you mean here.
> > Everything stays the same, and there is nothing to change here.
> > 
> 
> I am using still V10 since V11(also V12/nv-6.9-sr-enforcement) has
> issues to boot with QEMU.

Let's be clear: I have no interest in reports against a version that
is older than the current one. If you still use V10, then
congratulations, you are the maintainer of that version.

> Tried V11 with my local branch of QEMU which
> is 7.2 based and also with Eric's QEMU[1] which rebased on 8.2. The
> issue is QEMU crashes at the very beginning itself. Not sure about the
> issue and yet to debug.
> 
> [1] https://github.com/eauger/qemu/tree/v8.2-nv

I have already reported that QEMU was doing some horrible things
behind the kernel's back, and I don't think it is working correctly.

> 
> > What you describe looks like a terrible bug somewhere on the
> > page-fault path that has the potential to impact non-NV, and I'd like
> > to focus on that.
> 
> I found the bug/issue and fixed it.
> The problem was so random and was happening when tried booting L1 with
> large cores(200 to 300+).
> 
> I have implemented(yet to send to ML for review) to fix the
> performance issue[2] due to unmapping of Shadow tables by implementing
> the lookup table to unmap only the mapped Shadow IPAs instead of
> unmapping complete Shadow S2 of all active NestedVMs.

Again, this is irrelevant:

- you develop against an unmaintained version

- you waste time prematurely optimising code that is clearly
  advertised as throw-away

> 
> This lookup table was not adding the mappings created for the L1 when
> it is using the shadow S2-MMU(my bad, missed to notice that the L1
> hops between vEL2 and EL1 at the booting stage), hence when there is a
> page migration, the unmap was not getting done for those pages and
> resulting in access of stale pages/memory by the some of the VCPUs of
> L1.
> 
> I have modified the check while adding the Shadow-IPA to PA mapping to
> a lookup table to check, is this page is getting mapped to NestedVMs
> or to  a L1 while it is using Shadow S2.
> 
> [2] https://www.spinics.net/lists/kvm/msg326638.html

Do I read it correctly that I wasted hours trying to reproduce
something that only exists with on an obsolete series together with
private patches?

	M.

-- 
Without deviation from the norm, progress is not possible.