Re: [PATCH v11 17/43] KVM: arm64: nv: Support multiple nested Stage-2 mmu structures

Marc Zyngier <maz@xxxxxxxxxx> · Thu, 25 Jan 2024 08:58:46 +0000

On Thu, 25 Jan 2024 08:14:32 +0000,
Ganapatrao Kulkarni <gankulkarni@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> 
> 
> Hi Marc,
> 
> On 23-01-2024 07:56 pm, Marc Zyngier wrote:
> > Hi Ganapatrao,
> > 
> > On Tue, 23 Jan 2024 09:55:32 +0000,
> > Ganapatrao Kulkarni <gankulkarni@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> >> 
> >> Hi Marc,
> >> 
> >>> +void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
> >>> +{
> >>> +	if (is_hyp_ctxt(vcpu)) {
> >>> +		vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu;
> >>> +	} else {
> >>> +		write_lock(&vcpu->kvm->mmu_lock);
> >>> +		vcpu->arch.hw_mmu = get_s2_mmu_nested(vcpu);
> >>> +		write_unlock(&vcpu->kvm->mmu_lock);
> >>> +	}
> >> 
> >> Due to race, there is a non-existing L2's mmu table is getting loaded
> >> for some of vCPU while booting L1(noticed with L1 boot using large
> >> number of vCPUs). This is happening since at the early stage the
> >> e2h(hyp-context) is not set and trap to eret of L1 boot-strap code
> >> resulting in context switch as if it is returning to L2(guest enter)
> >> and loading not initialized mmu table on those vCPUs resulting in
> >> unrecoverable traps and aborts.
> > 
> > I'm not sure I understand the problem you're describing here.
> > 
> 
> IIUC, When the S2 fault happens, the faulted vCPU gets the pages from
> qemu process and maps in S2 and copies the code to allocated
> memory. Mean while other vCPUs which are in race to come online, when
> they switches over to dummy S2 finds the mapping and returns to L1 and
> subsequent execution does not fault instead fetches from memory where
> no code exists yet(for some) and generates stage 1 instruction abort
> and jumps to abort handler and even there is no code exist and keeps
> aborting. This is happening on random vCPUs(no pattern).

Why is that any different from the way we handle faults in the
non-nested case? If there is a case where we can map the PTE at S2
before the data is available, this is a generic bug that can trigger
irrespective of NV.

> 
> > What is the race exactly? Why isn't the shadow S2 good enough? Not
> > having HCR_EL2.VM set doesn't mean we can use the same S2, as the TLBs
> > are tagged by a different VMID, so staying on the canonical S2 seems
> > wrong.
> 
> IMO, it is unnecessary to switch-over for first ERET while L1 is
> booting and repeat the faults and page allocation which is anyway
> dummy once L1 switches to E2H.

It is mandated by the architecture. EL1 is, by definition, a different
translation regime from EL2. So we *must* have a different S2, because
that defines the boundaries of TLB creation and invalidation. The
fact that these are the same pages is totally irrelevant.

> Let L1 use its S2 always which is created by L0. Even we should
> consider avoiding the entry created for L1 in array(first entry in the
> array) of S2-MMUs and avoid unnecessary iteration/lookup while unmap
> of NestedVMs.

I'm sorry, but this is just wrong. You are merging the EL1 and EL2
translation regimes, which is not acceptable.

> I am anticipating this unwanted switch-over wont happen when we have
> NV2 only support in V12?

V11 is already NV2 only, so I really don't get what you mean here.
Everything stays the same, and there is nothing to change here.

What you describe looks like a terrible bug somewhere on the
page-fault path that has the potential to impact non-NV, and I'd like
to focus on that.

I've been booting my L1 with a fairly large number of vcpus (32 vcpu
for 6 physical CPUs), and I don't see this.

Since you seem to have a way to trigger it on your HW, can you please
pinpoint the situation where we map the page without having the
corresponding data?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.