Re: [RFC PATCH v1] KVM: arm64: Introduce KVM_CAP_ARM_SIGBUS_ON_SEA

Jiaqi Yan <jiaqiyan@xxxxxxxxxx> · Fri, 8 Nov 2024 13:18:50 -0800

On Sun, Nov 3, 2024 at 9:02 PM Jiaqi Yan <jiaqiyan@xxxxxxxxxx> wrote:
>
> Hi Marc, thanks for your quick response!
>
> On Fri, Nov 1, 2024 at 6:54 AM Marc Zyngier <maz@xxxxxxxxxx> wrote:
> >
> > On Thu, 31 Oct 2024 21:21:04 +0000,
> > Jiaqi Yan <jiaqiyan@xxxxxxxxxx> wrote:
> > >
> > > Currently KVM handles SEA in guest by injecting async SError into
> > > guest directly, bypassing VMM, usually results in guest kernel panic.
> > >
> > > One major situation of guest SEA is when vCPU consumes uncorrectable
> > > memory error on the physical memory. Although SError and guest kernel
> > > panic effectively stops the propagation of corrupted memory, it is not
> > > easy for VMM and guest to recover from memory error in a more graceful
> > > manner.
> > >
> > > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
> > > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
> > > thread.
> >
> > Can you elaborate on why the delivery of a signal is preferable to
> > simply exiting back to userspace with a description of the error?
> > Signals are usually not generated by KVM, and are a pretty twisted way
> > to generate an exit.
>
> A couple of reasons. First, my intuition is that KVM and core kernel
> (do_sea) should have aligned behavior when APEI failed to claim the
> SEA. Second, if we only talk about SEA caused by memory poison
> consumption, both arm64 and x86 KVM already send SIGBUS to VMM/vCPU
> thread (kvm_send_hwpoison_signal) to signal hardware memory failure,
> although the situation is slightly different here, where we have a
> hardware event, versus a HWPoison flag check or VM_FAULT_HWPOISON
> returned. But from VMM/vCPU's perspective, hardware event or software
> level VM_FAULT_HWPOISON, it would be nice if it can react to just the
> same event, the SIGBUS signal.
>
> And there is another reason around your comment on arm64_notify_die.
>
> By "exiting back to userspace with a description of the error", are
> you suggesting KVM_EXIT_MEMORY_FAULT? If so, we may need to add a new
> flag to tell VMM the error is hardware memory poison, which could be
> KVM_MEMORY_EXIT_FLAG_USERFAULT[1] if we don't want a specific one (but
> I think a specific flag for hwpoison is probably clearer).
>
> >
> > > In addition to the benifit that KVM's handling for SEA becomes aligned
> > > with core kernel behavior
> > > - The blast radius in VM can be limited to only the consuming thread
> > >   in guest, instead of entire guest kernel, unless the consumption is
> > >   from guest kernel.
> > > - VMM now has the chance to do its duties to stop the VM from repeatedly
> > >   consuming corrupted data. For example, VMM can unmap the guest page
> > >   from stage-2 table to intercept forseen memory poison consumption,
> >
> > Not quite. The VMM doesn't manage stage-2. It can remove the page from
> > the VMA if it has it mapped, but that's it. The kernel deals with S2.
>
> I should probably not mention the implementation, "unmap from S2".
> What is needed for preventing repeated consuming memory poison is
> simply preventing guest access to certain memory pages. There is a
> work in progress KVM API [1] by my colleague James.
>
> [1] https://lpc.events/event/18/contributions/1757/attachments/1442/3073/LPC_%20KVM%20Userfault.pdf
>
> >
> > Which brings me to the next subject: when the kernel unmaps the page
> > at S2, it is likely to use CMOs. Can these CMOs create RAS error
> > themselves?
>
> I assume CMO here means writing dirty cache lines to memory. Writing
> something new to a poisoned cacheline usually won't cause RAS error.
> Notifying memory poison usually is delayed to a memory load
> transaction.
>
> >
> > >   and for every consumption injects SEA to EL1 with synthetic memory
> > >   error CPER.
> >
> > Why do we need to involve ACPI here? I would expect the production of
> > an architected error record instead. Or at least be given the option.
>
> Sorry, I was just mentioning a specific VMM's implementation. There
> are definitely multiple options (Machine Check MSRs vs CPER for error
> description data, SEA vs SDEI vs SPI for notification mechanisms) for
> how VMM involves the guest to handle memory error. My preference is:
> VMM populates CPER in guest HEST when VMM instructs KVM to inject
> i/dabt to the guest.
>
> And a word about "be given the option": I think when VMM receives
> SIGBUS with si_addr=faulted/poisoned HVA, it's got all these options,
> like using the si_addr to construct CPER with poisoned guest physical
> address, or mci_address MSR.
>
> >
> > > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
> > > can opt in this new capability if it prefers SIGBUS than SError
> > > injection during VM init. Now SEA handling in KVM works as follows:
> > > 1. Delegate to APEI/GHES to see if SEA can be claimed by them.
> > > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
> > >    enabled for the VM, and the SEA is NOT about translation table,
> > >    send SIGBUS BUS_OBJERR signal with host virtual address.
> >
> > And what if it is? S1 PTs are backed by userspace memory, like
> > anything else. I don't think we should have a different treatment of
> > those, because the HW wouldn't treat them differently either.
>
> You are talking about ESR_ELx_FSC_SEA_TTW(1), or
> ESR_ELx_FSC_SEA_TTW(0), right? I think you are right, S1 is no
> difference.
>
> But I think we want to make an exception for SEA about S2 PTs.
>
> >
> > > 3. Otherwise directly inject async SError to guest.
> > >
> > > Tested on a machine running Siryn AmpereOne processor. A in-house VMM
> > > that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application
> > > in VM allocated some memory buffer. The test used EINJ to inject an
> > > uncorrectable recoverable memory error at a page in the allocated memory
> > > buffer. The dummy application then consumed the memory error. Some hack
> > > was done to make core kernel's memory_failure triggered by poison
> > > generation to fail, so KVM had to deal with the SEA guest abort due to
> > > poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with
> > > valid host virtual address of the poisoned page. VMM then injected a SEA
> > > into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last
> > > the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the
> > > guest survived and continued to run.
> > >
> > > Signed-off-by: Jiaqi Yan <jiaqiyan@xxxxxxxxxx>
> > > ---
> > >  arch/arm64/include/asm/kvm_host.h |  2 +
> > >  arch/arm64/include/asm/kvm_ras.h  | 20 ++++----
> > >  arch/arm64/kvm/Makefile           |  2 +-
> > >  arch/arm64/kvm/arm.c              |  5 ++
> > >  arch/arm64/kvm/kvm_ras.c          | 77 +++++++++++++++++++++++++++++++
> > >  arch/arm64/kvm/mmu.c              |  8 +---
> > >  include/uapi/linux/kvm.h          |  1 +
> > >  7 files changed, 98 insertions(+), 17 deletions(-)
> > >  create mode 100644 arch/arm64/kvm/kvm_ras.c
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > index bf64fed9820ea..eb37a2489411a 100644
> > > --- a/arch/arm64/include/asm/kvm_host.h
> > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > @@ -334,6 +334,8 @@ struct kvm_arch {
> > >       /* Fine-Grained UNDEF initialised */
> > >  #define KVM_ARCH_FLAG_FGU_INITIALIZED                        8
> > >       unsigned long flags;
> > > +     /* Instead of injecting SError into guest, SIGBUS VMM */
> > > +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA                  9
> >
> > nit: why do you put this definition out of sequence (below 'flags')?
>
> Ah, I will move it on top of flags.
>
> >
> > >
> > >       /* VM-wide vCPU feature set */
> > >       DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES);
> > > diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h
> > > index 87e10d9a635b5..4bb7a424e3f6c 100644
> > > --- a/arch/arm64/include/asm/kvm_ras.h
> > > +++ b/arch/arm64/include/asm/kvm_ras.h
> > > @@ -11,15 +11,17 @@
> > >  #include <asm/acpi.h>
> > >
> > >  /*
> > > - * Was this synchronous external abort a RAS notification?
> > > - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> > > + * Handle synchronous external abort (SEA) in the following order:
> > > + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we
> > > + *    are all done.
> > > + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT
> > > + *    about translation table, send SIGBUS
> > > + *    - si_code is BUS_OBJERR.
> > > + *    - si_addr will be 0 when accurate HVA is unavailable.
> > > + * 3. Otherwise, directly inject an async SError to guest.
> > > + *
> > > + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*.
> > >   */
> > > -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr)
> > > -{
> > > -     /* apei_claim_sea(NULL) expects to mask interrupts itself */
> > > -     lockdep_assert_irqs_enabled();
> > > -
> > > -     return apei_claim_sea(NULL);
> > > -}
> > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
> > >
> > >  #endif /* __ARM64_KVM_RAS_H__ */
> > > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> > > index 3cf7adb2b5038..c4a3a6d4870e6 100644
> > > --- a/arch/arm64/kvm/Makefile
> > > +++ b/arch/arm64/kvm/Makefile
> > > @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
> > >        vgic/vgic-v3.o vgic/vgic-v4.o \
> > >        vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
> > >        vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
> > > -      vgic/vgic-its.o vgic/vgic-debug.o
> > > +      vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o
> > >
> > >  kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
> > >  kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
> > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > index 48cafb65d6acf..bb97ad678dbec 100644
> > > --- a/arch/arm64/kvm/arm.c
> > > +++ b/arch/arm64/kvm/arm.c
> > > @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > >               }
> > >               mutex_unlock(&kvm->slots_lock);
> > >               break;
> > > +     case KVM_CAP_ARM_SIGBUS_ON_SEA:
> > > +             r = 0;
> > > +             set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags);
> >
> > Shouldn't this be somehow gated on the VM being RAS aware?
>
> Do you mean a CAP that VMM can tell KVM the VM guest has RAS ability?
> I don't know if there is one for arm64. On x86 there is
> KVM_X86_SETUP_MCE. KVM_CAP_ARM_INJECT_EXT_DABT maybe a revelant one
> but I don't think it is exactly the one for "RAS ability".

Hi Marc,

There is also a kernel config called ARM64_RAS_EXTN but I believe it
is for the host CPU, not about VM's RAS.

But taking Oliver's opinion into account, I think we want to remove
KVM_CAP_ARM_SIGBUS_ON_SEA. Wonder what your thoughts are.

Thanks,
Jiaqi

>
> >
> > > +             break;
> > >       default:
> > >               break;
> > >       }
> > > @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > >       case KVM_CAP_ARM_SYSTEM_SUSPEND:
> > >       case KVM_CAP_IRQFD_RESAMPLE:
> > >       case KVM_CAP_COUNTER_OFFSET:
> > > +     case KVM_CAP_ARM_SIGBUS_ON_SEA:
> > >               r = 1;
> > >               break;
> > >       case KVM_CAP_SET_GUEST_DEBUG2:
> > > diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c
> > > new file mode 100644
> > > index 0000000000000..3225462bcbcda
> > > --- /dev/null
> > > +++ b/arch/arm64/kvm/kvm_ras.c
> > > @@ -0,0 +1,77 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +
> > > +#include <linux/bitops.h>
> > > +#include <linux/kvm_host.h>
> > > +
> > > +#include <asm/kvm_emulate.h>
> > > +#include <asm/kvm_ras.h>
> > > +#include <asm/system_misc.h>
> > > +
> > > +/*
> > > + * For synchrnous external instruction or data abort, not on translation
> > > + * table walk or hardware update of translation table, is FAR_EL2 valid?
> > > + */
> > > +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu)
> > > +{
> > > +     return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV);
> > > +}
> > > +
> > > +/*
> > > + * Was this synchronous external abort a RAS notification?
> > > + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> > > + */
> > > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
> > > +{
> > > +     /* apei_claim_sea(NULL) expects to mask interrupts itself */
> > > +     lockdep_assert_irqs_enabled();
> > > +     return apei_claim_sea(NULL);
> > > +}
> > > +
> > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > > +{
> > > +     bool sigbus_on_sea;
> > > +     int idx;
> > > +     u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
> > > +     u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
> > > +     phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> > > +     gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> > > +     /* When FnV is set, send 0 as si_addr like what do_sea() does. */
> > > +     unsigned long hva = 0UL;
> > > +
> > > +     /*
> > > +      * For RAS the host kernel may handle this abort.
> > > +      * There is no need to SIGBUS VMM, or pass the error into the guest.
> > > +      */
> > > +     if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
> > > +             return;
> > > +
> > > +     sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
> > > +                              &(vcpu->kvm->arch.flags));
> > > +
> > > +     /*
> > > +      * In addition to userspace opt-in, SIGBUS only makes sense if the
> > > +      * abort is NOT about translation table walk and NOT about hardware
> > > +      * update of translation table.
> > > +      */
> > > +     sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);
> > > +
> > > +     /* Pass the error directly into the guest. */
> > > +     if (!sigbus_on_sea) {
> > > +             kvm_inject_vabt(vcpu);
> > > +             return;
> > > +     }
> > > +
> > > +     if (kvm_vcpu_sea_far_valid(vcpu)) {
> > > +             idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > +             hva = gfn_to_hva(vcpu->kvm, gfn);
> > > +             srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > > +     }
> > > +
> > > +     /*
> > > +      * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that
> > > +      * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea()
> > > +      * does if apei_claim_sea() fails.
> > > +      */
> > > +     arm64_notify_die("synchronous external abort",
> > > +                      current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr);
> >
> > This is the point where I really think we should simply trigger an
> > exit with all that syndrome information stashed in kvm_run, like any
> > other event requiring userspace help.
>
> Ah, there is another reason SIGBUS is better than kvm exit: "It is a
> programming error to set ext_dabt_pending after an exit which was not
> either KVM_EXIT_MMIO or KVM_EXIT_ARM_NISV", from
> Documentation/virt/kvm/api.rst. So if VMM is allowed to inject data
> abort to guest, at least current documentation doesn't suggest kvm
> exit is feasible.
>
> >
> > Also: where is the documentation?
>
> Once I get more positive feedback and send out PATCH instead of RFC, I
> can add to Documentation/virt/kvm/api.rst.
>
> >
> > Thanks,
> >
> >         M.
> >
> > --
> > Without deviation from the norm, progress is not possible.