Re: [PATCH v5 18/20] RISC-V: KVM: Add SBI v0.1 support

Anup Patel <anup@xxxxxxxxxxxxxx> · Fri, 23 Aug 2019 17:58:57 +0530

On Fri, Aug 23, 2019 at 5:50 PM Alexander Graf <graf@xxxxxxxxxx> wrote:
>
>
>
> On 23.08.19 14:00, Anup Patel wrote:
> > On Fri, Aug 23, 2019 at 5:09 PM Graf (AWS), Alexander <graf@xxxxxxxxxx> wrote:
> >>
> >>
> >>
> >>> Am 23.08.2019 um 13:18 schrieb Anup Patel <anup@xxxxxxxxxxxxxx>:
> >>>
> >>>> On Fri, Aug 23, 2019 at 1:34 PM Alexander Graf <graf@xxxxxxxxxx> wrote:
> >>>>
> >>>>> On 22.08.19 10:46, Anup Patel wrote:
> >>>>> From: Atish Patra <atish.patra@xxxxxxx>
> >>>>>
> >>>>> The KVM host kernel running in HS-mode needs to handle SBI calls coming
> >>>>> from guest kernel running in VS-mode.
> >>>>>
> >>>>> This patch adds SBI v0.1 support in KVM RISC-V. All the SBI calls are
> >>>>> implemented correctly except remote tlb flushes. For remote TLB flushes,
> >>>>> we are doing full TLB flush and this will be optimized in future.
> >>>>>
> >>>>> Signed-off-by: Atish Patra <atish.patra@xxxxxxx>
> >>>>> Signed-off-by: Anup Patel <anup.patel@xxxxxxx>
> >>>>> Acked-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> >>>>> Reviewed-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> >>>>> ---
> >>>>>   arch/riscv/include/asm/kvm_host.h |   2 +
> >>>>>   arch/riscv/kvm/Makefile           |   2 +-
> >>>>>   arch/riscv/kvm/vcpu_exit.c        |   3 +
> >>>>>   arch/riscv/kvm/vcpu_sbi.c         | 119 ++++++++++++++++++++++++++++++
> >>>>>   4 files changed, 125 insertions(+), 1 deletion(-)
> >>>>>   create mode 100644 arch/riscv/kvm/vcpu_sbi.c
> >>>>>
> >>>>> diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h
> >>>>> index 2af3a179c08e..0b1eceaef59f 100644
> >>>>> --- a/arch/riscv/include/asm/kvm_host.h
> >>>>> +++ b/arch/riscv/include/asm/kvm_host.h
> >>>>> @@ -241,4 +241,6 @@ bool kvm_riscv_vcpu_has_interrupt(struct kvm_vcpu *vcpu);
> >>>>>   void kvm_riscv_vcpu_power_off(struct kvm_vcpu *vcpu);
> >>>>>   void kvm_riscv_vcpu_power_on(struct kvm_vcpu *vcpu);
> >>>>>
> >>>>> +int kvm_riscv_vcpu_sbi_ecall(struct kvm_vcpu *vcpu);
> >>>>> +
> >>>>>   #endif /* __RISCV_KVM_HOST_H__ */
> >>>>> diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile
> >>>>> index 3e0c7558320d..b56dc1650d2c 100644
> >>>>> --- a/arch/riscv/kvm/Makefile
> >>>>> +++ b/arch/riscv/kvm/Makefile
> >>>>> @@ -9,6 +9,6 @@ ccflags-y := -Ivirt/kvm -Iarch/riscv/kvm
> >>>>>   kvm-objs := $(common-objs-y)
> >>>>>
> >>>>>   kvm-objs += main.o vm.o vmid.o tlb.o mmu.o
> >>>>> -kvm-objs += vcpu.o vcpu_exit.o vcpu_switch.o vcpu_timer.o
> >>>>> +kvm-objs += vcpu.o vcpu_exit.o vcpu_switch.o vcpu_timer.o vcpu_sbi.o
> >>>>>
> >>>>>   obj-$(CONFIG_KVM)   += kvm.o
> >>>>> diff --git a/arch/riscv/kvm/vcpu_exit.c b/arch/riscv/kvm/vcpu_exit.c
> >>>>> index fbc04fe335ad..87b83fcf9a14 100644
> >>>>> --- a/arch/riscv/kvm/vcpu_exit.c
> >>>>> +++ b/arch/riscv/kvm/vcpu_exit.c
> >>>>> @@ -534,6 +534,9 @@ int kvm_riscv_vcpu_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
> >>>>>                   (vcpu->arch.guest_context.hstatus & HSTATUS_STL))
> >>>>>                       ret = stage2_page_fault(vcpu, run, scause, stval);
> >>>>>               break;
> >>>>> +     case EXC_SUPERVISOR_SYSCALL:
> >>>>> +             if (vcpu->arch.guest_context.hstatus & HSTATUS_SPV)
> >>>>> +                     ret = kvm_riscv_vcpu_sbi_ecall(vcpu);
> >>>>>       default:
> >>>>>               break;
> >>>>>       };
> >>>>> diff --git a/arch/riscv/kvm/vcpu_sbi.c b/arch/riscv/kvm/vcpu_sbi.c
> >>>>> new file mode 100644
> >>>>> index 000000000000..5793202eb514
> >>>>> --- /dev/null
> >>>>> +++ b/arch/riscv/kvm/vcpu_sbi.c
> >>>>> @@ -0,0 +1,119 @@
> >>>>> +// SPDX-License-Identifier: GPL-2.0
> >>>>> +/**
> >>>>> + * Copyright (c) 2019 Western Digital Corporation or its affiliates.
> >>>>> + *
> >>>>> + * Authors:
> >>>>> + *     Atish Patra <atish.patra@xxxxxxx>
> >>>>> + */
> >>>>> +
> >>>>> +#include <linux/errno.h>
> >>>>> +#include <linux/err.h>
> >>>>> +#include <linux/kvm_host.h>
> >>>>> +#include <asm/csr.h>
> >>>>> +#include <asm/kvm_vcpu_timer.h>
> >>>>> +
> >>>>> +#define SBI_VERSION_MAJOR                    0
> >>>>> +#define SBI_VERSION_MINOR                    1
> >>>>> +
> >>>>> +/* TODO: Handle traps due to unpriv load and redirect it back to VS-mode */
> >>>>
> >>>> Ugh, another one of those? Can't you just figure out a way to recover
> >>>> from the page fault? Also, you want to combine this with the instruction
> >>>> load logic, so that we have a single place that guest address space
> >>>> reads go through.
> >>>
> >>> Walking Guest page table would be more expensive compared to implementing
> >>> a trap handling mechanism.
> >>>
> >>> We will be adding trap handling mechanism for reading instruction and reading
> >>> load.
> >>>
> >>> Both these operations are different in following ways:
> >>> 1. RISC-V instructions are variable length. We get to know exact instruction
> >>>     length only after reading first 16bits
> >>> 2. We need to set VSSTATUS.MXR bit when reading instruction for
> >>>     execute-only Guest pages.
> >>
> >> Yup, sounds like you could solve that with a trivial if() based on "read instruction" or not, no? If you want to, feel free to provide short versions that do only read ins/data, but I would really like to see the whole "data reads become guest reads" magic to be funneled through a single function (in C, can be inline unrolled in asm of course)
> >>
> >>>
> >>>>
> >>>>> +static unsigned long kvm_sbi_unpriv_load(const unsigned long *addr,
> >>>>> +                                      struct kvm_vcpu *vcpu)
> >>>>> +{
> >>>>> +     unsigned long flags, val;
> >>>>> +     unsigned long __hstatus, __sstatus;
> >>>>> +
> >>>>> +     local_irq_save(flags);
> >>>>> +     __hstatus = csr_read(CSR_HSTATUS);
> >>>>> +     __sstatus = csr_read(CSR_SSTATUS);
> >>>>> +     csr_write(CSR_HSTATUS, vcpu->arch.guest_context.hstatus | HSTATUS_SPRV);
> >>>>> +     csr_write(CSR_SSTATUS, vcpu->arch.guest_context.sstatus);
> >>>>> +     val = *addr;
> >>>>> +     csr_write(CSR_HSTATUS, __hstatus);
> >>>>> +     csr_write(CSR_SSTATUS, __sstatus);
> >>>>> +     local_irq_restore(flags);
> >>>>> +
> >>>>> +     return val;
> >>>>> +}
> >>>>> +
> >>>>> +static void kvm_sbi_system_shutdown(struct kvm_vcpu *vcpu, u32 type)
> >>>>> +{
> >>>>> +     int i;
> >>>>> +     struct kvm_vcpu *tmp;
> >>>>> +
> >>>>> +     kvm_for_each_vcpu(i, tmp, vcpu->kvm)
> >>>>> +             tmp->arch.power_off = true;
> >>>>> +     kvm_make_all_cpus_request(vcpu->kvm, KVM_REQ_SLEEP);
> >>>>> +
> >>>>> +     memset(&vcpu->run->system_event, 0, sizeof(vcpu->run->system_event));
> >>>>> +     vcpu->run->system_event.type = type;
> >>>>> +     vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
> >>>>> +}
> >>>>> +
> >>>>> +int kvm_riscv_vcpu_sbi_ecall(struct kvm_vcpu *vcpu)
> >>>>> +{
> >>>>> +     int ret = 1;
> >>>>> +     u64 next_cycle;
> >>>>> +     int vcpuid;
> >>>>> +     struct kvm_vcpu *remote_vcpu;
> >>>>> +     ulong dhart_mask;
> >>>>> +     struct kvm_cpu_context *cp = &vcpu->arch.guest_context;
> >>>>> +
> >>>>> +     if (!cp)
> >>>>> +             return -EINVAL;
> >>>>> +     switch (cp->a7) {
> >>>>> +     case SBI_SET_TIMER:
> >>>>> +#if __riscv_xlen == 32
> >>>>> +             next_cycle = ((u64)cp->a1 << 32) | (u64)cp->a0;
> >>>>> +#else
> >>>>> +             next_cycle = (u64)cp->a0;
> >>>>> +#endif
> >>>>> +             kvm_riscv_vcpu_timer_next_event(vcpu, next_cycle);
> >>>>
> >>>> Ah, this is where the timer set happens. I still don't understand how
> >>>> this takes the frequency bit into account?
> >>>
> >>> Explained it in PATCH17 comments.
> >>>
> >>>>
> >>>>> +             break;
> >>>>> +     case SBI_CONSOLE_PUTCHAR:
> >>>>> +             /* Not implemented */
> >>>>> +             cp->a0 = -ENOTSUPP;
> >>>>> +             break;
> >>>>> +     case SBI_CONSOLE_GETCHAR:
> >>>>> +             /* Not implemented */
> >>>>> +             cp->a0 = -ENOTSUPP;
> >>>>> +             break;
> >>>>
> >>>> These two should be covered by the default case.
> >>>
> >>> Sure, I will update.
> >>>
> >>>>
> >>>>> +     case SBI_CLEAR_IPI:
> >>>>> +             kvm_riscv_vcpu_unset_interrupt(vcpu, IRQ_S_SOFT);
> >>>>> +             break;
> >>>>> +     case SBI_SEND_IPI:
> >>>>> +             dhart_mask = kvm_sbi_unpriv_load((unsigned long *)cp->a0, vcpu);
> >>>>> +             for_each_set_bit(vcpuid, &dhart_mask, BITS_PER_LONG) {
> >>>>> +                     remote_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, vcpuid);
> >>>>> +                     kvm_riscv_vcpu_set_interrupt(remote_vcpu, IRQ_S_SOFT);
> >>>>> +             }
> >>>>> +             break;
> >>>>> +     case SBI_SHUTDOWN:
> >>>>> +             kvm_sbi_system_shutdown(vcpu, KVM_SYSTEM_EVENT_SHUTDOWN);
> >>>>> +             ret = 0;
> >>>>> +             break;
> >>>>> +     case SBI_REMOTE_FENCE_I:
> >>>>> +             sbi_remote_fence_i(NULL);
> >>>>> +             break;
> >>>>> +     /*
> >>>>> +      * TODO: There should be a way to call remote hfence.bvma.
> >>>>> +      * Preferred method is now a SBI call. Until then, just flush
> >>>>> +      * all tlbs.
> >>>>> +      */
> >>>>> +     case SBI_REMOTE_SFENCE_VMA:
> >>>>> +             /*TODO: Parse vma range.*/
> >>>>> +             sbi_remote_sfence_vma(NULL, 0, 0);
> >>>>> +             break;
> >>>>> +     case SBI_REMOTE_SFENCE_VMA_ASID:
> >>>>> +             /*TODO: Parse vma range for given ASID */
> >>>>> +             sbi_remote_sfence_vma(NULL, 0, 0);
> >>>>> +             break;
> >>>>> +     default:
> >>>>> +             cp->a0 = ENOTSUPP;
> >>>>> +             break;
> >>>>
> >>>> Please just send unsupported SBI events into user space.
> >>>
> >>> For unsupported SBI calls, we should be returning error to the
> >>> Guest Linux so that do something about it. This is in accordance
> >>> with the SBI spec.
> >>
> >> That's up to user space (QEMU / kvmtool) to decide. If user space wants to implement the  console functions (like we do on s390), it should have the chance to do so.
> >
> > The SBI_CONSOLE_PUTCHAR and SBI_CONSOLE_GETCHAR are
> > for debugging only. These calls are deprecated in SBI v0.2 onwards
> > because we now have earlycon for early prints in Linux RISC-V.
> >
> > The RISC-V Guest will generally have it's own MMIO based UART
> > which will be the default console.
> >
> > Due to these reasons, we have not implemented these SBI calls.
>
> I'm not saying we should implement them. I'm saying we should leave a
> policy decision like that up to user space. By terminating the SBI in
> kernel space, you can not quickly debug something going wrong.
>
> > If we still want user-space to implement this then we will require
> > separate exit reasons and we are trying to avoid adding RISC-V
> > specific exit reasons/ioctls in KVM user-space ABI.
>
> Why?
>
> I had so many occasions where I would have loved to have user space
> exits for MSR access, SPR access, hypercalls, etc etc. It really makes
> life so much easier when you can quickly hack something up in user space
> rather than modify the kernel.
>
> > The absence of SBI_CONSOLE_PUTCHAR/GETCHAR certainly
> > does not block anyone in debugging Guest Linux because we have
> > earlycon support in Linux RISC-V.
>
> I'm not hung on on the console. What I'm trying to express is a general
> sentiment that terminating extensible hypervisor <-> guest interfaces in
> kvm is not a great idea. Some times we can't get around it (like on page
> tables), but some times we do. And this is a case where we could.
>
> At the end of the day this is your call though :).

I am not sure about user-space CSRs but having ability to route
unsupported SBI calls to user-space can be very useful.

For this series, we will continue to return error to Guest Linux for
unsupported SBI calls.

We will add unsupported SBI routing to user-space in next series.

Regards,
Anup

>
>
> Alex