Re: [PATCH v5 18/43] arm64: RME: Handle realm enter/exit

Steven Price <steven.price@xxxxxxx> · Fri, 29 Nov 2024 14:55:36 +0000

On 29/11/2024 13:45, Suzuki K Poulose wrote:
> Hi Steven
> 
> On 29/11/2024 12:18, Steven Price wrote:
>> Hi Suzuki,
>>
>> Sorry for the very slow response to this. Coming back to this I'm having
>> doubts, see below.
>>
>> On 17/10/2024 14:00, Suzuki K Poulose wrote:
>>> On 04/10/2024 16:27, Steven Price wrote:
>>>> Entering a realm is done using a SMC call to the RMM. On exit the
>>>> exit-codes need to be handled slightly differently to the normal KVM
>>>> path so define our own functions for realm enter/exit and hook them
>>>> in if the guest is a realm guest.
>>>>
>>>> Signed-off-by: Steven Price <steven.price@xxxxxxx>
>> ...
>>>> diff --git a/arch/arm64/kvm/rme-exit.c b/arch/arm64/kvm/rme-exit.c
>>>> new file mode 100644
>>>> index 000000000000..e96ea308212c
>>>> --- /dev/null
>>>> +++ b/arch/arm64/kvm/rme-exit.c
>> ...
>>>> +static int rec_exit_ripas_change(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +    struct kvm *kvm = vcpu->kvm;
>>>> +    struct realm *realm = &kvm->arch.realm;
>>>> +    struct realm_rec *rec = &vcpu->arch.rec;
>>>> +    unsigned long base = rec->run->exit.ripas_base;
>>>> +    unsigned long top = rec->run->exit.ripas_top;
>>>> +    unsigned long ripas = rec->run->exit.ripas_value;
>>>> +    unsigned long top_ipa;
>>>> +    int ret;
>>>> +
>>>> +    if (!realm_is_addr_protected(realm, base) ||
>>>> +        !realm_is_addr_protected(realm, top - 1)) {
>>>> +        kvm_err("Invalid RIPAS_CHANGE for %#lx - %#lx, ripas: %#lx\n",
>>>> +            base, top, ripas);
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>> +    kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_cache,
>>>> +                   kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu));
>>>
>>> I think we also need to filter the request for RIPAS_RAM, by consulting
>>> if the "range" is backed by a memslot or not. If they are not, we should
>>> reject the request with a response flag set in run.enter.flags.
>>
>> It's an interesting API question. At the moment there is no requirement
>> to have an active memslot to set the RIPAS - this is true both during
>> the setup by the VMM and at run time.
>>
>> In theory a VMM can create/destroy memslots while the guest is running.
>> So absense of a memslot doesn't actually imply that the RIPAS change
> 
> Agreed. Whether an IPA range may be used as RAM is a decision that the
> VMM must make. So, we could give the VMM a chance to respond to this
> request before we (KVM) make the RTT changes.
> 
>> should be rejected. Obviously with realms this is tricky because when
>> destroying a memslot that's in use KVM would rip those pages out from
>> the guest and it would require guest cooperation to restore those pages
>> (transition to RIPAS_EMPTY and back to RIPAS_RAM). But it's not
>> something that has been prohibited so far.
> 
> True, and it shouldn't be prohibited. If the Host wants to take away a
> memslot it must be able to do that. But if it wants to do that in
> good faith with the Realm, there must have been some communication
> (e.g., virtio-mem ?) between the Host and the Realm and as long as the
> Realm knows not to trust the contents on that region it could be
> recovered without a transition to EMPTY.
> 
> e.g. From RIPAS_DESTROYED => RIPAS_RAM with RSI_SET_IPA_STATE(...
> CHANGE_DESTROYED).

Indeed - I always forget RSI_SET_IPA_STATE has two modes these days.

>>
>> On the other hand this is a clear way for a (malicious/buggy) guest to
>> use a fair bit of RAM by transitioning to RIPAS_RAM (sparse) pages not
>> in a memslot and forcing KVM to allocate the RTT pages to delegate to
>> the RMM. But we do exit to the VMM, so this is solvable in the VMM (by
>> killing a misbehaving guest). The number of pages this would consume per
>> exit is also fairly small.
> 
> Correct. If the VMM has no intention to provide memory at a given IPA
> range, KVM shouldn't report RSI_ACCEPT to the Realm and the Realm later
> gets a stage2 fault that cannot be serviced by KVM.
> 
>>
>> So my instinct is that we shouldn't impose that requirement.
> 
> I think we may be able to fix this by letting the VMM ACCEPT or REJECT
> a given RIPAS_RAM transition request. That way, KVM isn't playing by
> the rules set by the VMM and whether the VMM wants to trick the Realm
> or play by the rules is upto it.

Sounds good to me.

>>
>> Any thoughts?
>>
>>> As for EMPTY requests, if the guest wants to explicitly mark any range
>>> as EMPTY, it doesn't matter, as long as it is within the protected IPA.
>>> (even though they may be EMPTY in the first place).
>>>
>>>> +    write_lock(&kvm->mmu_lock);
>>>> +    ret = realm_set_ipa_state(vcpu, base, top, ripas, &top_ipa);
>>>> +    write_unlock(&kvm->mmu_lock);
>>>> +
>>>> +    WARN(ret && ret != -ENOMEM,
>>>> +         "Unable to satisfy RIPAS_CHANGE for %#lx - %#lx, ripas:
>>>> %#lx\n",
>>>> +         base, top, ripas);
>>>> +
>>>> +    /* Exit to VMM to complete the change */
>>>> +    kvm_prepare_memory_fault_exit(vcpu, base, top_ipa - base, false,
>>>> false,
>>>> +                      ripas == RMI_RAM);
>>>
>>> Again this may only be need if the range is backed by a memslot ?
>>> Otherwise the VMM has nothing to do.
>>
>> Assuming the above, then the VMM would be the one to kill a misbehaving
>> guest, so would need a notification.
> 
> May be we could reverse the order of operations by delaying the
> realm_set_ipa_state() to occur on VMMs request from the memory_fault_exit.

Ah, good point - moving the RIPAS state set to the entry path makes a
lot of sense. The only negative is that we push the loop handling
partial RIPAS changes into the KVM entry path - but I don't think that's
a major problem.

Thanks,
Steve

> 
> Suzuki
> 
>>
>> Thanks,
>> Steve
>>
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static void update_arch_timer_irq_lines(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +    struct realm_rec *rec = &vcpu->arch.rec;
>>>> +
>>>> +    __vcpu_sys_reg(vcpu, CNTV_CTL_EL0) = rec->run->exit.cntv_ctl;
>>>> +    __vcpu_sys_reg(vcpu, CNTV_CVAL_EL0) = rec->run->exit.cntv_cval;
>>>> +    __vcpu_sys_reg(vcpu, CNTP_CTL_EL0) = rec->run->exit.cntp_ctl;
>>>> +    __vcpu_sys_reg(vcpu, CNTP_CVAL_EL0) = rec->run->exit.cntp_cval;
>>>> +
>>>> +    kvm_realm_timers_update(vcpu);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Return > 0 to return to guest, < 0 on error, 0 (and set
>>>> exit_reason) on
>>>> + * proper exit to userspace.
>>>> + */
>>>> +int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_ret)
>>>> +{
>>>> +    struct realm_rec *rec = &vcpu->arch.rec;
>>>> +    u8 esr_ec = ESR_ELx_EC(rec->run->exit.esr);
>>>> +    unsigned long status, index;
>>>> +
>>>> +    status = RMI_RETURN_STATUS(rec_run_ret);
>>>> +    index = RMI_RETURN_INDEX(rec_run_ret);
>>>> +
>>>> +    /*
>>>> +     * If a PSCI_SYSTEM_OFF request raced with a vcpu executing, we
>>>> might
>>>> +     * see the following status code and index indicating an attempt
>>>> to run
>>>> +     * a REC when the RD state is SYSTEM_OFF.  In this case, we just
>>>> need to
>>>> +     * return to user space which can deal with the system event or
>>>> will try
>>>> +     * to run the KVM VCPU again, at which point we will no longer
>>>> attempt
>>>> +     * to enter the Realm because we will have a sleep request
>>>> pending on
>>>> +     * the VCPU as a result of KVM's PSCI handling.
>>>> +     */
>>>> +    if (status == RMI_ERROR_REALM && index == 1) {
>>>> +        vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    if (rec_run_ret)
>>>> +        return -ENXIO;
>>>> +
>>>> +    vcpu->arch.fault.esr_el2 = rec->run->exit.esr;
>>>> +    vcpu->arch.fault.far_el2 = rec->run->exit.far;
>>>> +    vcpu->arch.fault.hpfar_el2 = rec->run->exit.hpfar;
>>>> +
>>>> +    update_arch_timer_irq_lines(vcpu);
>>>> +
>>>> +    /* Reset the emulation flags for the next run of the REC */
>>>> +    rec->run->enter.flags = 0;
>>>> +
>>>> +    switch (rec->run->exit.exit_reason) {
>>>> +    case RMI_EXIT_SYNC:
>>>> +        return rec_exit_handlers[esr_ec](vcpu);
>>>> +    case RMI_EXIT_IRQ:
>>>> +    case RMI_EXIT_FIQ:
>>>> +        return 1;
>>>> +    case RMI_EXIT_PSCI:
>>>> +        return rec_exit_psci(vcpu);
>>>> +    case RMI_EXIT_RIPAS_CHANGE:
>>>> +        return rec_exit_ripas_change(vcpu);
>>>> +    }
>>>> +
>>>> +    kvm_pr_unimpl("Unsupported exit reason: %u\n",
>>>> +              rec->run->exit.exit_reason);
>>>> +    vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
>>>> +    return 0;
>>>> +}
>>>> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
>>>> index 1fa9991d708b..4c0751231810 100644
>>>> --- a/arch/arm64/kvm/rme.c
>>>> +++ b/arch/arm64/kvm/rme.c
>>>> @@ -899,6 +899,25 @@ void kvm_destroy_realm(struct kvm *kvm)
>>>>        kvm_free_stage2_pgd(&kvm->arch.mmu);
>>>>    }
>>>>    +int kvm_rec_enter(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +    struct realm_rec *rec = &vcpu->arch.rec;
>>>> +
>>>> +    switch (rec->run->exit.exit_reason) {
>>>> +    case RMI_EXIT_HOST_CALL:
>>>> +    case RMI_EXIT_PSCI:
>>>> +        for (int i = 0; i < REC_RUN_GPRS; i++)
>>>> +            rec->run->enter.gprs[i] = vcpu_get_reg(vcpu, i);
>>>> +        break;
>>>> +    }
>>>
>>> As mentioned in the patch following (MMIO emulation support), we may be
>>> able to do this unconditionally for all REC entries, to cover ourselves
>>> from missing out other cases. The RMM is in charge of taking the
>>> appropriate action anyways to copy the results back.
>>>
>>> Suzuki
>>>
>>>> +
>>>> +    if (kvm_realm_state(vcpu->kvm) != REALM_STATE_ACTIVE)
>>>> +        return -EINVAL;
>>>> +
>>>> +    return rmi_rec_enter(virt_to_phys(rec->rec_page),
>>>> +                 virt_to_phys(rec->run));
>>>> +}
>>>> +
>>>>    static void free_rec_aux(struct page **aux_pages,
>>>>                 unsigned int num_aux)
>>>>    {
>>
>