Re: [RFC PATCH v1] KVM: arm64: Introduce KVM_CAP_ARM_SIGBUS_ON_SEA

Jiaqi Yan <jiaqiyan@xxxxxxxxxx> · Tue, 19 Nov 2024 15:57:46 -0800

Thanks for your comments, Marc.

While continuing the discussion here, I think it may make sense I sent
out a V2 with 2 major updates:
- add documentation for new SIGBUS feature
- remove KVM_CAP_ARM_SIGBUS_ON_SEA

On Tue, Nov 12, 2024 at 12:51 AM Marc Zyngier <maz@xxxxxxxxxx> wrote:
>
> On Mon, 04 Nov 2024 05:02:03 +0000,
> Jiaqi Yan <jiaqiyan@xxxxxxxxxx> wrote:
> >
> > Hi Marc, thanks for your quick response!
> >
> > On Fri, Nov 1, 2024 at 6:54 AM Marc Zyngier <maz@xxxxxxxxxx> wrote:
> > >
> > > On Thu, 31 Oct 2024 21:21:04 +0000,
> > > Jiaqi Yan <jiaqiyan@xxxxxxxxxx> wrote:
> > > >
> > > > Currently KVM handles SEA in guest by injecting async SError into
> > > > guest directly, bypassing VMM, usually results in guest kernel panic.
> > > >
> > > > One major situation of guest SEA is when vCPU consumes uncorrectable
> > > > memory error on the physical memory. Although SError and guest kernel
> > > > panic effectively stops the propagation of corrupted memory, it is not
> > > > easy for VMM and guest to recover from memory error in a more graceful
> > > > manner.
> > > >
> > > > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
> > > > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
> > > > thread.
> > >
> > > Can you elaborate on why the delivery of a signal is preferable to
> > > simply exiting back to userspace with a description of the error?
> > > Signals are usually not generated by KVM, and are a pretty twisted way
> > > to generate an exit.
> >
> > A couple of reasons. First, my intuition is that KVM and core kernel
> > (do_sea) should have aligned behavior when APEI failed to claim the
> > SEA. Second, if we only talk about SEA caused by memory poison
> > consumption, both arm64 and x86 KVM already send SIGBUS to VMM/vCPU
> > thread (kvm_send_hwpoison_signal) to signal hardware memory failure,
> > although the situation is slightly different here, where we have a
> > hardware event, versus a HWPoison flag check or VM_FAULT_HWPOISON
> > returned. But from VMM/vCPU's perspective, hardware event or software
> > level VM_FAULT_HWPOISON, it would be nice if it can react to just the
> > same event, the SIGBUS signal.
> >
> > And there is another reason around your comment on arm64_notify_die.
> >
> > By "exiting back to userspace with a description of the error", are
> > you suggesting KVM_EXIT_MEMORY_FAULT? If so, we may need to add a new
> > flag to tell VMM the error is hardware memory poison, which could be
> > KVM_MEMORY_EXIT_FLAG_USERFAULT[1] if we don't want a specific one (but
> > I think a specific flag for hwpoison is probably clearer).
>
> KVM_MEMORY_EXIT_FLAG_USERFAULT is not upstream, and it isn't clear to
> me if or when it will make it there. But the idea is indeed to treat a
> RAS error as an exit reason.

Let me try again on preferring SIGBUS to KVM_EXIT. At least for memory error.

Let's say we should really use some KVM_EXIT to describe RAS error
(e.g. KVM_EXIT_MEMORY_FAULT for memory error). Does that mean we need
to change these existing SIGBUSes [1,2] in KVM into KVM_EXIT? Doesn't
it break VMM [3,4] that expects SIGBUS for memory errors?

The difference between [1,2] and the scenario dealt by this patch is
just whether APEI/GHES/memory_failure succeeded in recovering a memory
error or not. Is the idea that the uAPI will be different depending on
successful recovery or not?

[1] https://elixir.bootlin.com/linux/v6.12-rc7/source/arch/arm64/kvm/mmu.c#L1576
[2] https://elixir.bootlin.com/linux/v6.12-rc7/source/arch/x86/kvm/mmu/mmu.c#L3301
[3] https://github.com/qemu/qemu/blob/master/target/i386/kvm/kvm.c#L738
[4] https://github.com/qemu/qemu/blob/master/target/arm/kvm.c#L2365

>
> >
> > >
> > > > In addition to the benifit that KVM's handling for SEA becomes aligned
> > > > with core kernel behavior
> > > > - The blast radius in VM can be limited to only the consuming thread
> > > >   in guest, instead of entire guest kernel, unless the consumption is
> > > >   from guest kernel.
> > > > - VMM now has the chance to do its duties to stop the VM from repeatedly
> > > >   consuming corrupted data. For example, VMM can unmap the guest page
> > > >   from stage-2 table to intercept forseen memory poison consumption,
> > >
> > > Not quite. The VMM doesn't manage stage-2. It can remove the page from
> > > the VMA if it has it mapped, but that's it. The kernel deals with S2.
> >
> > I should probably not mention the implementation, "unmap from S2".
> > What is needed for preventing repeated consuming memory poison is
> > simply preventing guest access to certain memory pages. There is a
> > work in progress KVM API [1] by my colleague James.
> >
> > [1] https://lpc.events/event/18/contributions/1757/attachments/1442/3073/LPC_%20KVM%20Userfault.pdf
> >
> > >
> > > Which brings me to the next subject: when the kernel unmaps the page
> > > at S2, it is likely to use CMOs. Can these CMOs create RAS error
> > > themselves?
> >
> > I assume CMO here means writing dirty cache lines to memory. Writing
> > something new to a poisoned cacheline usually won't cause RAS error.
> > Notifying memory poison usually is delayed to a memory load
> > transaction.
>
> I don't see anything of the sort in the spec. At least R_VGXBJ calls
> out that:
>
> <quote>
> An error is propagated by the PE by one or more of the following
> occurring that would not have been permitted to occur had the fault
> not been activated:
>
> * Consumption of the corrupt value by any instruction, propagating the
> error to the target(s) of the instruction.  This includes:
>
>   - A store of a corrupt value.
> [...]
> </quote>
>
> So while implementations /may/ only act and deliver an error on a
> load, that's only an implementation detail.

Thanks for finding out from the spec, lesson learned. Back to your
original question and assume error will be propagated by the PE, is
your concern that "unmap from S2" will cause a SEA on TTW or hardware
update of translation table level 2 (i.e. DFSC=0b010110), which
renders the system into a worse fault state?

>
> >
> > >
> > > >   and for every consumption injects SEA to EL1 with synthetic memory
> > > >   error CPER.
> > >
> > > Why do we need to involve ACPI here? I would expect the production of
> > > an architected error record instead. Or at least be given the option.
> >
> > Sorry, I was just mentioning a specific VMM's implementation. There
> > are definitely multiple options (Machine Check MSRs vs CPER for error
> > description data, SEA vs SDEI vs SPI for notification mechanisms) for
> > how VMM involves the guest to handle memory error. My preference is:
> > VMM populates CPER in guest HEST when VMM instructs KVM to inject
> > i/dabt to the guest.
> >
> > And a word about "be given the option": I think when VMM receives
> > SIGBUS with si_addr=faulted/poisoned HVA, it's got all these options,
> > like using the si_addr to construct CPER with poisoned guest physical
> > address, or mci_address MSR.
> >
> > >
> > > > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
> > > > can opt in this new capability if it prefers SIGBUS than SError
> > > > injection during VM init. Now SEA handling in KVM works as follows:
> > > > 1. Delegate to APEI/GHES to see if SEA can be claimed by them.
> > > > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
> > > >    enabled for the VM, and the SEA is NOT about translation table,
> > > >    send SIGBUS BUS_OBJERR signal with host virtual address.
> > >
> > > And what if it is? S1 PTs are backed by userspace memory, like
> > > anything else. I don't think we should have a different treatment of
> > > those, because the HW wouldn't treat them differently either.
> >
> > You are talking about ESR_ELx_FSC_SEA_TTW(1), or
> > ESR_ELx_FSC_SEA_TTW(0), right? I think you are right, S1 is no
> > difference.
> >
> > But I think we want to make an exception for SEA about S2 PTs.
>
> S2 PTs are owned by the hypervisor, not by userspace, so I'm fine not
> reporting those through this mechanism.

Ack, will update in the v2.

>
> >
> > >
> > > > 3. Otherwise directly inject async SError to guest.
> > > >
> > > > Tested on a machine running Siryn AmpereOne processor. A in-house VMM
> > > > that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application
> > > > in VM allocated some memory buffer. The test used EINJ to inject an
> > > > uncorrectable recoverable memory error at a page in the allocated memory
> > > > buffer. The dummy application then consumed the memory error. Some hack
> > > > was done to make core kernel's memory_failure triggered by poison
> > > > generation to fail, so KVM had to deal with the SEA guest abort due to
> > > > poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with
> > > > valid host virtual address of the poisoned page. VMM then injected a SEA
> > > > into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last
> > > > the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the
> > > > guest survived and continued to run.
> > > >
> > > > Signed-off-by: Jiaqi Yan <jiaqiyan@xxxxxxxxxx>
> > > > ---
> > > >  arch/arm64/include/asm/kvm_host.h |  2 +
> > > >  arch/arm64/include/asm/kvm_ras.h  | 20 ++++----
> > > >  arch/arm64/kvm/Makefile           |  2 +-
> > > >  arch/arm64/kvm/arm.c              |  5 ++
> > > >  arch/arm64/kvm/kvm_ras.c          | 77 +++++++++++++++++++++++++++++++
> > > >  arch/arm64/kvm/mmu.c              |  8 +---
> > > >  include/uapi/linux/kvm.h          |  1 +
> > > >  7 files changed, 98 insertions(+), 17 deletions(-)
> > > >  create mode 100644 arch/arm64/kvm/kvm_ras.c
> > > >
> > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > > index bf64fed9820ea..eb37a2489411a 100644
> > > > --- a/arch/arm64/include/asm/kvm_host.h
> > > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > > @@ -334,6 +334,8 @@ struct kvm_arch {
> > > >       /* Fine-Grained UNDEF initialised */
> > > >  #define KVM_ARCH_FLAG_FGU_INITIALIZED                        8
> > > >       unsigned long flags;
> > > > +     /* Instead of injecting SError into guest, SIGBUS VMM */
> > > > +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA                  9
> > >
> > > nit: why do you put this definition out of sequence (below 'flags')?
> >
> > Ah, I will move it on top of flags.
> >
> > >
> > > >
> > > >       /* VM-wide vCPU feature set */
> > > >       DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES);
> > > > diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h
> > > > index 87e10d9a635b5..4bb7a424e3f6c 100644
> > > > --- a/arch/arm64/include/asm/kvm_ras.h
> > > > +++ b/arch/arm64/include/asm/kvm_ras.h
> > > > @@ -11,15 +11,17 @@
> > > >  #include <asm/acpi.h>
> > > >
> > > >  /*
> > > > - * Was this synchronous external abort a RAS notification?
> > > > - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> > > > + * Handle synchronous external abort (SEA) in the following order:
> > > > + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we
> > > > + *    are all done.
> > > > + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT
> > > > + *    about translation table, send SIGBUS
> > > > + *    - si_code is BUS_OBJERR.
> > > > + *    - si_addr will be 0 when accurate HVA is unavailable.
> > > > + * 3. Otherwise, directly inject an async SError to guest.
> > > > + *
> > > > + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*.
> > > >   */
> > > > -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr)
> > > > -{
> > > > -     /* apei_claim_sea(NULL) expects to mask interrupts itself */
> > > > -     lockdep_assert_irqs_enabled();
> > > > -
> > > > -     return apei_claim_sea(NULL);
> > > > -}
> > > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
> > > >
> > > >  #endif /* __ARM64_KVM_RAS_H__ */
> > > > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> > > > index 3cf7adb2b5038..c4a3a6d4870e6 100644
> > > > --- a/arch/arm64/kvm/Makefile
> > > > +++ b/arch/arm64/kvm/Makefile
> > > > @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
> > > >        vgic/vgic-v3.o vgic/vgic-v4.o \
> > > >        vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
> > > >        vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
> > > > -      vgic/vgic-its.o vgic/vgic-debug.o
> > > > +      vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o
> > > >
> > > >  kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
> > > >  kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
> > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > > index 48cafb65d6acf..bb97ad678dbec 100644
> > > > --- a/arch/arm64/kvm/arm.c
> > > > +++ b/arch/arm64/kvm/arm.c
> > > > @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > > >               }
> > > >               mutex_unlock(&kvm->slots_lock);
> > > >               break;
> > > > +     case KVM_CAP_ARM_SIGBUS_ON_SEA:
> > > > +             r = 0;
> > > > +             set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags);
> > >
> > > Shouldn't this be somehow gated on the VM being RAS aware?
> >
> > Do you mean a CAP that VMM can tell KVM the VM guest has RAS ability?
> > I don't know if there is one for arm64. On x86 there is
> > KVM_X86_SETUP_MCE. KVM_CAP_ARM_INJECT_EXT_DABT maybe a revelant one
> > but I don't think it is exactly the one for "RAS ability".
>
> Having though about this a bit more, I now think this is independent
> of the guest supporting RAS. This really is about the VMM asking to be
> made aware of RAS errors affecting the guest, and it is the signalling
> back to the guest that needs to be gated by ID_AA64PFR0_EL1.RAS.

Just to make sure I fully catch you. I think ID_AA64PFR0_EL1.RAS
translates to ARM64_HAS_RAS_EXTN in the kernel. If VMM signals RAS
error back to the guest with SEA, are you suggesting
__kvm_arm_vcpu_set_events should check
cpus_have_final_cap(ARM64_HAS_RAS_EXTN) before it
kvm_inject_dabt(vcpu)?

If so, how could __kvm_arm_vcpu_set_events know if the error is about
RAS (e.g. memory error) vs about accessing memory not in a memslot
(i.e. KVM_EXIT_ARM_NISV)? I guess KVM needs to look at ESR_EL2 again
(e.g. kvm_vcpu_abt_issea vs kvm_vcpu_dabt_isvalid)?

>
> >
> > >
> > > > +             break;
> > > >       default:
> > > >               break;
> > > >       }
> > > > @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > > >       case KVM_CAP_ARM_SYSTEM_SUSPEND:
> > > >       case KVM_CAP_IRQFD_RESAMPLE:
> > > >       case KVM_CAP_COUNTER_OFFSET:
> > > > +     case KVM_CAP_ARM_SIGBUS_ON_SEA:
> > > >               r = 1;
> > > >               break;
> > > >       case KVM_CAP_SET_GUEST_DEBUG2:
> > > > diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c
> > > > new file mode 100644
> > > > index 0000000000000..3225462bcbcda
> > > > --- /dev/null
> > > > +++ b/arch/arm64/kvm/kvm_ras.c
> > > > @@ -0,0 +1,77 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-only
> > > > +
> > > > +#include <linux/bitops.h>
> > > > +#include <linux/kvm_host.h>
> > > > +
> > > > +#include <asm/kvm_emulate.h>
> > > > +#include <asm/kvm_ras.h>
> > > > +#include <asm/system_misc.h>
> > > > +
> > > > +/*
> > > > + * For synchrnous external instruction or data abort, not on translation
> > > > + * table walk or hardware update of translation table, is FAR_EL2 valid?
> > > > + */
> > > > +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu)
> > > > +{
> > > > +     return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Was this synchronous external abort a RAS notification?
> > > > + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> > > > + */
> > > > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
> > > > +{
> > > > +     /* apei_claim_sea(NULL) expects to mask interrupts itself */
> > > > +     lockdep_assert_irqs_enabled();
> > > > +     return apei_claim_sea(NULL);
> > > > +}
> > > > +
> > > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > > > +{
> > > > +     bool sigbus_on_sea;
> > > > +     int idx;
> > > > +     u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
> > > > +     u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
> > > > +     phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> > > > +     gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> > > > +     /* When FnV is set, send 0 as si_addr like what do_sea() does. */
> > > > +     unsigned long hva = 0UL;
> > > > +
> > > > +     /*
> > > > +      * For RAS the host kernel may handle this abort.
> > > > +      * There is no need to SIGBUS VMM, or pass the error into the guest.
> > > > +      */
> > > > +     if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
> > > > +             return;
> > > > +
> > > > +     sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
> > > > +                              &(vcpu->kvm->arch.flags));
> > > > +
> > > > +     /*
> > > > +      * In addition to userspace opt-in, SIGBUS only makes sense if the
> > > > +      * abort is NOT about translation table walk and NOT about hardware
> > > > +      * update of translation table.
> > > > +      */
> > > > +     sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);
> > > > +
> > > > +     /* Pass the error directly into the guest. */
> > > > +     if (!sigbus_on_sea) {
> > > > +             kvm_inject_vabt(vcpu);
> > > > +             return;
> > > > +     }
> > > > +
> > > > +     if (kvm_vcpu_sea_far_valid(vcpu)) {
> > > > +             idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > > +             hva = gfn_to_hva(vcpu->kvm, gfn);
> > > > +             srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > > > +     }
> > > > +
> > > > +     /*
> > > > +      * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that
> > > > +      * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea()
> > > > +      * does if apei_claim_sea() fails.
> > > > +      */
> > > > +     arm64_notify_die("synchronous external abort",
> > > > +                      current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr);
> > >
> > > This is the point where I really think we should simply trigger an
> > > exit with all that syndrome information stashed in kvm_run, like any
> > > other event requiring userspace help.
> >
> > Ah, there is another reason SIGBUS is better than kvm exit: "It is a
> > programming error to set ext_dabt_pending after an exit which was not
> > either KVM_EXIT_MMIO or KVM_EXIT_ARM_NISV", from
> > Documentation/virt/kvm/api.rst. So if VMM is allowed to inject data
> > abort to guest, at least current documentation doesn't suggest kvm
> > exit is feasible.
>
> I really can't make the link between the two. If you take this to the
> letter, then you can't inject an external abort on the back of a
> signal handler either.

Sorry for the confusion, I think my understanding of the doc and what
I said were both wrong. It seems to me that what the doc is suggesting
is, the use case of ext_dabt_pending (injecting SEA into guest) for
now is only to enable VMM to handle KVM_EXIT_MMIO and
KVM_EXIT_ARM_NISV.

With the new SIGBUS for memory error, is it fair to update the doc to
say another use case of ext_dabt_pending is to enable VMM to signal
guest of RAS error?

>
> Also, you're obviously changing the UAPI, so everything is fair game
> provided that you don't break backward compatibility.
>
> >
> > >
> > > Also: where is the documentation?
> >
> > Once I get more positive feedback and send out PATCH instead of RFC, I
> > can add to Documentation/virt/kvm/api.rst.
>
> Sorry, but that's a key point for reviewing a UAPI change.

Let me include the doc for the SIGBUS feature in v2.

>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.