Re: [PATCH v5 18/43] arm64: RME: Handle realm enter/exit

Suzuki K Poulose <suzuki.poulose@xxxxxxx> · Fri, 29 Nov 2024 13:45:25 +0000

Hi Steven

On 29/11/2024 12:18, Steven Price wrote:
Hi Suzuki,

Sorry for the very slow response to this. Coming back to this I'm having
doubts, see below.

On 17/10/2024 14:00, Suzuki K Poulose wrote:
On 04/10/2024 16:27, Steven Price wrote:
Entering a realm is done using a SMC call to the RMM. On exit the
exit-codes need to be handled slightly differently to the normal KVM
path so define our own functions for realm enter/exit and hook them
in if the guest is a realm guest.

Signed-off-by: Steven Price <steven.price@xxxxxxx>
...

diff --git a/arch/arm64/kvm/rme-exit.c b/arch/arm64/kvm/rme-exit.c
new file mode 100644
index 000000000000..e96ea308212c
--- /dev/null
+++ b/arch/arm64/kvm/rme-exit.c
...
+static int rec_exit_ripas_change(struct kvm_vcpu *vcpu)
+{
+    struct kvm *kvm = vcpu->kvm;
+    struct realm *realm = &kvm->arch.realm;
+    struct realm_rec *rec = &vcpu->arch.rec;
+    unsigned long base = rec->run->exit.ripas_base;
+    unsigned long top = rec->run->exit.ripas_top;
+    unsigned long ripas = rec->run->exit.ripas_value;
+    unsigned long top_ipa;
+    int ret;
+
+    if (!realm_is_addr_protected(realm, base) ||
+        !realm_is_addr_protected(realm, top - 1)) {
+        kvm_err("Invalid RIPAS_CHANGE for %#lx - %#lx, ripas: %#lx\n",
+            base, top, ripas);
+        return -EINVAL;
+    }
+
+    kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_cache,
+                   kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu));

I think we also need to filter the request for RIPAS_RAM, by consulting
if the "range" is backed by a memslot or not. If they are not, we should
reject the request with a response flag set in run.enter.flags.

It's an interesting API question. At the moment there is no requirement
to have an active memslot to set the RIPAS - this is true both during
the setup by the VMM and at run time.

In theory a VMM can create/destroy memslots while the guest is running.
So absense of a memslot doesn't actually imply that the RIPAS change

Agreed. Whether an IPA range may be used as RAM is a decision that the
VMM must make. So, we could give the VMM a chance to respond to this
request before we (KVM) make the RTT changes.

should be rejected. Obviously with realms this is tricky because when
destroying a memslot that's in use KVM would rip those pages out from
the guest and it would require guest cooperation to restore those pages
(transition to RIPAS_EMPTY and back to RIPAS_RAM). But it's not
something that has been prohibited so far.

True, and it shouldn't be prohibited. If the Host wants to take away a
memslot it must be able to do that. But if it wants to do that in
good faith with the Realm, there must have been some communication
(e.g., virtio-mem ?) between the Host and the Realm and as long as the
Realm knows not to trust the contents on that region it could be 
recovered without a transition to EMPTY.

e.g. From RIPAS_DESTROYED => RIPAS_RAM with RSI_SET_IPA_STATE(... 
CHANGE_DESTROYED).



On the other hand this is a clear way for a (malicious/buggy) guest to
use a fair bit of RAM by transitioning to RIPAS_RAM (sparse) pages not
in a memslot and forcing KVM to allocate the RTT pages to delegate to
the RMM. But we do exit to the VMM, so this is solvable in the VMM (by
killing a misbehaving guest). The number of pages this would consume per
exit is also fairly small.

Correct. If the VMM has no intention to provide memory at a given IPA
range, KVM shouldn't report RSI_ACCEPT to the Realm and the Realm later
gets a stage2 fault that cannot be serviced by KVM.


So my instinct is that we shouldn't impose that requirement.

I think we may be able to fix this by letting the VMM ACCEPT or REJECT
a given RIPAS_RAM transition request. That way, KVM isn't playing by
the rules set by the VMM and whether the VMM wants to trick the Realm
or play by the rules is upto it.



Any thoughts?

As for EMPTY requests, if the guest wants to explicitly mark any range
as EMPTY, it doesn't matter, as long as it is within the protected IPA.
(even though they may be EMPTY in the first place).

+    write_lock(&kvm->mmu_lock);
+    ret = realm_set_ipa_state(vcpu, base, top, ripas, &top_ipa);
+    write_unlock(&kvm->mmu_lock);
+
+    WARN(ret && ret != -ENOMEM,
+         "Unable to satisfy RIPAS_CHANGE for %#lx - %#lx, ripas:
%#lx\n",
+         base, top, ripas);
+
+    /* Exit to VMM to complete the change */
+    kvm_prepare_memory_fault_exit(vcpu, base, top_ipa - base, false,
false,
+                      ripas == RMI_RAM);

Again this may only be need if the range is backed by a memslot ?
Otherwise the VMM has nothing to do.

Assuming the above, then the VMM would be the one to kill a misbehaving
guest, so would need a notification.

May be we could reverse the order of operations by delaying the 
realm_set_ipa_state() to occur on VMMs request from the memory_fault_exit.


Suzuki


Thanks,
Steve

+
+    return 0;
+}
+
+static void update_arch_timer_irq_lines(struct kvm_vcpu *vcpu)
+{
+    struct realm_rec *rec = &vcpu->arch.rec;
+
+    __vcpu_sys_reg(vcpu, CNTV_CTL_EL0) = rec->run->exit.cntv_ctl;
+    __vcpu_sys_reg(vcpu, CNTV_CVAL_EL0) = rec->run->exit.cntv_cval;
+    __vcpu_sys_reg(vcpu, CNTP_CTL_EL0) = rec->run->exit.cntp_ctl;
+    __vcpu_sys_reg(vcpu, CNTP_CVAL_EL0) = rec->run->exit.cntp_cval;
+
+    kvm_realm_timers_update(vcpu);
+}
+
+/*
+ * Return > 0 to return to guest, < 0 on error, 0 (and set
exit_reason) on
+ * proper exit to userspace.
+ */
+int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_ret)
+{
+    struct realm_rec *rec = &vcpu->arch.rec;
+    u8 esr_ec = ESR_ELx_EC(rec->run->exit.esr);
+    unsigned long status, index;
+
+    status = RMI_RETURN_STATUS(rec_run_ret);
+    index = RMI_RETURN_INDEX(rec_run_ret);
+
+    /*
+     * If a PSCI_SYSTEM_OFF request raced with a vcpu executing, we
might
+     * see the following status code and index indicating an attempt
to run
+     * a REC when the RD state is SYSTEM_OFF.  In this case, we just
need to
+     * return to user space which can deal with the system event or
will try
+     * to run the KVM VCPU again, at which point we will no longer
attempt
+     * to enter the Realm because we will have a sleep request
pending on
+     * the VCPU as a result of KVM's PSCI handling.
+     */
+    if (status == RMI_ERROR_REALM && index == 1) {
+        vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+        return 0;
+    }
+
+    if (rec_run_ret)
+        return -ENXIO;
+
+    vcpu->arch.fault.esr_el2 = rec->run->exit.esr;
+    vcpu->arch.fault.far_el2 = rec->run->exit.far;
+    vcpu->arch.fault.hpfar_el2 = rec->run->exit.hpfar;
+
+    update_arch_timer_irq_lines(vcpu);
+
+    /* Reset the emulation flags for the next run of the REC */
+    rec->run->enter.flags = 0;
+
+    switch (rec->run->exit.exit_reason) {
+    case RMI_EXIT_SYNC:
+        return rec_exit_handlers[esr_ec](vcpu);
+    case RMI_EXIT_IRQ:
+    case RMI_EXIT_FIQ:
+        return 1;
+    case RMI_EXIT_PSCI:
+        return rec_exit_psci(vcpu);
+    case RMI_EXIT_RIPAS_CHANGE:
+        return rec_exit_ripas_change(vcpu);
+    }
+
+    kvm_pr_unimpl("Unsupported exit reason: %u\n",
+              rec->run->exit.exit_reason);
+    vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+    return 0;
+}
diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
index 1fa9991d708b..4c0751231810 100644
--- a/arch/arm64/kvm/rme.c
+++ b/arch/arm64/kvm/rme.c
@@ -899,6 +899,25 @@ void kvm_destroy_realm(struct kvm *kvm)
       kvm_free_stage2_pgd(&kvm->arch.mmu);
   }
   +int kvm_rec_enter(struct kvm_vcpu *vcpu)
+{
+    struct realm_rec *rec = &vcpu->arch.rec;
+
+    switch (rec->run->exit.exit_reason) {
+    case RMI_EXIT_HOST_CALL:
+    case RMI_EXIT_PSCI:
+        for (int i = 0; i < REC_RUN_GPRS; i++)
+            rec->run->enter.gprs[i] = vcpu_get_reg(vcpu, i);
+        break;
+    }

As mentioned in the patch following (MMIO emulation support), we may be
able to do this unconditionally for all REC entries, to cover ourselves
from missing out other cases. The RMM is in charge of taking the
appropriate action anyways to copy the results back.

Suzuki

+
+    if (kvm_realm_state(vcpu->kvm) != REALM_STATE_ACTIVE)
+        return -EINVAL;
+
+    return rmi_rec_enter(virt_to_phys(rec->rec_page),
+                 virt_to_phys(rec->run));
+}
+
   static void free_rec_aux(struct page **aux_pages,
                unsigned int num_aux)
   {