Re: [PATCH v11 17/43] KVM: arm64: nv: Support multiple nested Stage-2 mmu structures

Ganapatrao Kulkarni <gankulkarni@xxxxxxxxxxxxxxxxxxxxxx> · Wed, 31 Jan 2024 15:09:34 +0530

Hi Marc,

On 25-01-2024 02:28 pm, Marc Zyngier wrote:
On Thu, 25 Jan 2024 08:14:32 +0000,
Ganapatrao Kulkarni <gankulkarni@xxxxxxxxxxxxxxxxxxxxxx> wrote:

Hi Marc,

On 23-01-2024 07:56 pm, Marc Zyngier wrote:
Hi Ganapatrao,

On Tue, 23 Jan 2024 09:55:32 +0000,
Ganapatrao Kulkarni <gankulkarni@xxxxxxxxxxxxxxxxxxxxxx> wrote:

Hi Marc,

+void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
+{
+	if (is_hyp_ctxt(vcpu)) {
+		vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu;
+	} else {
+		write_lock(&vcpu->kvm->mmu_lock);
+		vcpu->arch.hw_mmu = get_s2_mmu_nested(vcpu);
+		write_unlock(&vcpu->kvm->mmu_lock);
+	}

Due to race, there is a non-existing L2's mmu table is getting loaded
for some of vCPU while booting L1(noticed with L1 boot using large
number of vCPUs). This is happening since at the early stage the
e2h(hyp-context) is not set and trap to eret of L1 boot-strap code
resulting in context switch as if it is returning to L2(guest enter)
and loading not initialized mmu table on those vCPUs resulting in
unrecoverable traps and aborts.

I'm not sure I understand the problem you're describing here.

IIUC, When the S2 fault happens, the faulted vCPU gets the pages from
qemu process and maps in S2 and copies the code to allocated
memory. Mean while other vCPUs which are in race to come online, when
they switches over to dummy S2 finds the mapping and returns to L1 and
subsequent execution does not fault instead fetches from memory where
no code exists yet(for some) and generates stage 1 instruction abort
and jumps to abort handler and even there is no code exist and keeps
aborting. This is happening on random vCPUs(no pattern).

Why is that any different from the way we handle faults in the
non-nested case? If there is a case where we can map the PTE at S2
before the data is available, this is a generic bug that can trigger
irrespective of NV.

What is the race exactly? Why isn't the shadow S2 good enough? Not
having HCR_EL2.VM set doesn't mean we can use the same S2, as the TLBs
are tagged by a different VMID, so staying on the canonical S2 seems
wrong.

IMO, it is unnecessary to switch-over for first ERET while L1 is
booting and repeat the faults and page allocation which is anyway
dummy once L1 switches to E2H.

It is mandated by the architecture. EL1 is, by definition, a different
translation regime from EL2. So we *must* have a different S2, because
that defines the boundaries of TLB creation and invalidation. The
fact that these are the same pages is totally irrelevant.

Let L1 use its S2 always which is created by L0. Even we should
consider avoiding the entry created for L1 in array(first entry in the
array) of S2-MMUs and avoid unnecessary iteration/lookup while unmap
of NestedVMs.

I'm sorry, but this is just wrong. You are merging the EL1 and EL2
translation regimes, which is not acceptable.

I am anticipating this unwanted switch-over wont happen when we have
NV2 only support in V12?

V11 is already NV2 only, so I really don't get what you mean here.
Everything stays the same, and there is nothing to change here.

I am using still V10 since V11(also V12/nv-6.9-sr-enforcement) has 
issues to boot with QEMU. Tried V11 with my local branch of QEMU which 
is 7.2 based and also with Eric's QEMU[1] which rebased on 8.2. The 
issue is QEMU crashes at the very beginning itself. Not sure about the 
issue and yet to debug.

[1] https://github.com/eauger/qemu/tree/v8.2-nv

What you describe looks like a terrible bug somewhere on the
page-fault path that has the potential to impact non-NV, and I'd like
to focus on that.

I found the bug/issue and fixed it.
The problem was so random and was happening when tried booting L1 with 
large cores(200 to 300+).

I have implemented(yet to send to ML for review) to fix the performance 
issue[2] due to unmapping of Shadow tables by implementing the lookup 
table to unmap only the mapped Shadow IPAs instead of unmapping complete 
Shadow S2 of all active NestedVMs.

This lookup table was not adding the mappings created for the L1 when it 
is using the shadow S2-MMU(my bad, missed to notice that the L1 hops 
between vEL2 and EL1 at the booting stage), hence when there is a page 
migration, the unmap was not getting done for those pages and resulting 
in access of stale pages/memory by the some of the VCPUs of L1.

I have modified the check while adding the Shadow-IPA to PA mapping to a 
lookup table to check, is this page is getting mapped to NestedVMs or to 
 a L1 while it is using Shadow S2.

[2] https://www.spinics.net/lists/kvm/msg326638.html

I've been booting my L1 with a fairly large number of vcpus (32 vcpu
for 6 physical CPUs), and I don't see this.

Since you seem to have a way to trigger it on your HW, can you please
pinpoint the situation where we map the page without having the
corresponding data?

Thanks,

	M.

Thanks,
Ganapat