Hi Marc,
On 23-01-2024 07:56 pm, Marc Zyngier wrote:
Hi Ganapatrao,
On Tue, 23 Jan 2024 09:55:32 +0000,
Ganapatrao Kulkarni <gankulkarni@xxxxxxxxxxxxxxxxxxxxxx> wrote:
Hi Marc,
+void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
+{
+ if (is_hyp_ctxt(vcpu)) {
+ vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu;
+ } else {
+ write_lock(&vcpu->kvm->mmu_lock);
+ vcpu->arch.hw_mmu = get_s2_mmu_nested(vcpu);
+ write_unlock(&vcpu->kvm->mmu_lock);
+ }
Due to race, there is a non-existing L2's mmu table is getting loaded
for some of vCPU while booting L1(noticed with L1 boot using large
number of vCPUs). This is happening since at the early stage the
e2h(hyp-context) is not set and trap to eret of L1 boot-strap code
resulting in context switch as if it is returning to L2(guest enter)
and loading not initialized mmu table on those vCPUs resulting in
unrecoverable traps and aborts.
I'm not sure I understand the problem you're describing here.
IIUC, When the S2 fault happens, the faulted vCPU gets the pages from
qemu process and maps in S2 and copies the code to allocated memory.
Mean while other vCPUs which are in race to come online, when they
switches over to dummy S2 finds the mapping and returns to L1 and
subsequent execution does not fault instead fetches from memory where no
code exists yet(for some) and generates stage 1 instruction abort and
jumps to abort handler and even there is no code exist and keeps
aborting. This is happening on random vCPUs(no pattern).
What is the race exactly? Why isn't the shadow S2 good enough? Not
having HCR_EL2.VM set doesn't mean we can use the same S2, as the TLBs
are tagged by a different VMID, so staying on the canonical S2 seems
wrong.
IMO, it is unnecessary to switch-over for first ERET while L1 is booting
and repeat the faults and page allocation which is anyway dummy once L1
switches to E2H.
Let L1 use its S2 always which is created by L0. Even we should consider
avoiding the entry created for L1 in array(first entry in the array) of
S2-MMUs and avoid unnecessary iteration/lookup while unmap of NestedVMs.
I am anticipating this unwanted switch-over wont happen when we have NV2
only support in V12?
My expectations are that the L1 ERET from EL2 to EL1 is trapped, and
that we pick an empty S2 and start populating it. What fails in this
process?
Adding code to check (below diff fixes the issue) stage 2 is enabled
and avoid the false ERET and continue with L1's mmu context.
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 340e2710cdda..1901dd19d770 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -759,7 +759,12 @@ void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
{
- if (is_hyp_ctxt(vcpu)) {
+ bool nested_stage2_enabled = vcpu_read_sys_reg(vcpu, HCR_EL2)
& HCR_VM;
+
+ /* Load L2 mmu only if nested_stage2_enabled, avoid mmu
+ * load due to false ERET trap.
+ */
+ if (is_hyp_ctxt(vcpu) || !nested_stage2_enabled) {
vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu;
} else {
write_lock(&vcpu->kvm->mmu_lock);
As I said above, this doesn't look right.
Hoping we dont hit this when we move completely NV2 based
implementation and e2h is always set?
No, the same constraints apply. I don't see why this would change.
Thanks,
M.
Thanks,
Ganapat