On Thu, 31 Oct 2024 21:21:04 +0000, Jiaqi Yan <jiaqiyan@xxxxxxxxxx> wrote: > > Currently KVM handles SEA in guest by injecting async SError into > guest directly, bypassing VMM, usually results in guest kernel panic. > > One major situation of guest SEA is when vCPU consumes uncorrectable > memory error on the physical memory. Although SError and guest kernel > panic effectively stops the propagation of corrupted memory, it is not > easy for VMM and guest to recover from memory error in a more graceful > manner. > > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming > thread. Can you elaborate on why the delivery of a signal is preferable to simply exiting back to userspace with a description of the error? Signals are usually not generated by KVM, and are a pretty twisted way to generate an exit. > In addition to the benifit that KVM's handling for SEA becomes aligned > with core kernel behavior > - The blast radius in VM can be limited to only the consuming thread > in guest, instead of entire guest kernel, unless the consumption is > from guest kernel. > - VMM now has the chance to do its duties to stop the VM from repeatedly > consuming corrupted data. For example, VMM can unmap the guest page > from stage-2 table to intercept forseen memory poison consumption, Not quite. The VMM doesn't manage stage-2. It can remove the page from the VMA if it has it mapped, but that's it. The kernel deals with S2. Which brings me to the next subject: when the kernel unmaps the page at S2, it is likely to use CMOs. Can these CMOs create RAS error themselves? > and for every consumption injects SEA to EL1 with synthetic memory > error CPER. Why do we need to involve ACPI here? I would expect the production of an architected error record instead. Or at least be given the option. > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM > can opt in this new capability if it prefers SIGBUS than SError > injection during VM init. Now SEA handling in KVM works as follows: > 1. Delegate to APEI/GHES to see if SEA can be claimed by them. > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is > enabled for the VM, and the SEA is NOT about translation table, > send SIGBUS BUS_OBJERR signal with host virtual address. And what if it is? S1 PTs are backed by userspace memory, like anything else. I don't think we should have a different treatment of those, because the HW wouldn't treat them differently either. > 3. Otherwise directly inject async SError to guest. > > Tested on a machine running Siryn AmpereOne processor. A in-house VMM > that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application > in VM allocated some memory buffer. The test used EINJ to inject an > uncorrectable recoverable memory error at a page in the allocated memory > buffer. The dummy application then consumed the memory error. Some hack > was done to make core kernel's memory_failure triggered by poison > generation to fail, so KVM had to deal with the SEA guest abort due to > poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with > valid host virtual address of the poisoned page. VMM then injected a SEA > into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last > the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the > guest survived and continued to run. > > Signed-off-by: Jiaqi Yan <jiaqiyan@xxxxxxxxxx> > --- > arch/arm64/include/asm/kvm_host.h | 2 + > arch/arm64/include/asm/kvm_ras.h | 20 ++++---- > arch/arm64/kvm/Makefile | 2 +- > arch/arm64/kvm/arm.c | 5 ++ > arch/arm64/kvm/kvm_ras.c | 77 +++++++++++++++++++++++++++++++ > arch/arm64/kvm/mmu.c | 8 +--- > include/uapi/linux/kvm.h | 1 + > 7 files changed, 98 insertions(+), 17 deletions(-) > create mode 100644 arch/arm64/kvm/kvm_ras.c > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > index bf64fed9820ea..eb37a2489411a 100644 > --- a/arch/arm64/include/asm/kvm_host.h > +++ b/arch/arm64/include/asm/kvm_host.h > @@ -334,6 +334,8 @@ struct kvm_arch { > /* Fine-Grained UNDEF initialised */ > #define KVM_ARCH_FLAG_FGU_INITIALIZED 8 > unsigned long flags; > + /* Instead of injecting SError into guest, SIGBUS VMM */ > +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA 9 nit: why do you put this definition out of sequence (below 'flags')? > > /* VM-wide vCPU feature set */ > DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES); > diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h > index 87e10d9a635b5..4bb7a424e3f6c 100644 > --- a/arch/arm64/include/asm/kvm_ras.h > +++ b/arch/arm64/include/asm/kvm_ras.h > @@ -11,15 +11,17 @@ > #include <asm/acpi.h> > > /* > - * Was this synchronous external abort a RAS notification? > - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. > + * Handle synchronous external abort (SEA) in the following order: > + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we > + * are all done. > + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT > + * about translation table, send SIGBUS > + * - si_code is BUS_OBJERR. > + * - si_addr will be 0 when accurate HVA is unavailable. > + * 3. Otherwise, directly inject an async SError to guest. > + * > + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*. > */ > -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr) > -{ > - /* apei_claim_sea(NULL) expects to mask interrupts itself */ > - lockdep_assert_irqs_enabled(); > - > - return apei_claim_sea(NULL); > -} > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu); > > #endif /* __ARM64_KVM_RAS_H__ */ > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile > index 3cf7adb2b5038..c4a3a6d4870e6 100644 > --- a/arch/arm64/kvm/Makefile > +++ b/arch/arm64/kvm/Makefile > @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \ > vgic/vgic-v3.o vgic/vgic-v4.o \ > vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \ > vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \ > - vgic/vgic-its.o vgic/vgic-debug.o > + vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o > > kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o > kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > index 48cafb65d6acf..bb97ad678dbec 100644 > --- a/arch/arm64/kvm/arm.c > +++ b/arch/arm64/kvm/arm.c > @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, > } > mutex_unlock(&kvm->slots_lock); > break; > + case KVM_CAP_ARM_SIGBUS_ON_SEA: > + r = 0; > + set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags); Shouldn't this be somehow gated on the VM being RAS aware? > + break; > default: > break; > } > @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > case KVM_CAP_ARM_SYSTEM_SUSPEND: > case KVM_CAP_IRQFD_RESAMPLE: > case KVM_CAP_COUNTER_OFFSET: > + case KVM_CAP_ARM_SIGBUS_ON_SEA: > r = 1; > break; > case KVM_CAP_SET_GUEST_DEBUG2: > diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c > new file mode 100644 > index 0000000000000..3225462bcbcda > --- /dev/null > +++ b/arch/arm64/kvm/kvm_ras.c > @@ -0,0 +1,77 @@ > +// SPDX-License-Identifier: GPL-2.0-only > + > +#include <linux/bitops.h> > +#include <linux/kvm_host.h> > + > +#include <asm/kvm_emulate.h> > +#include <asm/kvm_ras.h> > +#include <asm/system_misc.h> > + > +/* > + * For synchrnous external instruction or data abort, not on translation > + * table walk or hardware update of translation table, is FAR_EL2 valid? > + */ > +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu) > +{ > + return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV); > +} > + > +/* > + * Was this synchronous external abort a RAS notification? > + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. > + */ > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr) > +{ > + /* apei_claim_sea(NULL) expects to mask interrupts itself */ > + lockdep_assert_irqs_enabled(); > + return apei_claim_sea(NULL); > +} > + > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu) > +{ > + bool sigbus_on_sea; > + int idx; > + u64 vcpu_esr = kvm_vcpu_get_esr(vcpu); > + u8 fsc = kvm_vcpu_trap_get_fault(vcpu); > + phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu); > + gfn_t gfn = fault_ipa >> PAGE_SHIFT; > + /* When FnV is set, send 0 as si_addr like what do_sea() does. */ > + unsigned long hva = 0UL; > + > + /* > + * For RAS the host kernel may handle this abort. > + * There is no need to SIGBUS VMM, or pass the error into the guest. > + */ > + if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0) > + return; > + > + sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, > + &(vcpu->kvm->arch.flags)); > + > + /* > + * In addition to userspace opt-in, SIGBUS only makes sense if the > + * abort is NOT about translation table walk and NOT about hardware > + * update of translation table. > + */ > + sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC); > + > + /* Pass the error directly into the guest. */ > + if (!sigbus_on_sea) { > + kvm_inject_vabt(vcpu); > + return; > + } > + > + if (kvm_vcpu_sea_far_valid(vcpu)) { > + idx = srcu_read_lock(&vcpu->kvm->srcu); > + hva = gfn_to_hva(vcpu->kvm, gfn); > + srcu_read_unlock(&vcpu->kvm->srcu, idx); > + } > + > + /* > + * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that > + * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea() > + * does if apei_claim_sea() fails. > + */ > + arm64_notify_die("synchronous external abort", > + current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr); This is the point where I really think we should simply trigger an exit with all that syndrome information stashed in kvm_run, like any other event requiring userspace help. Also: where is the documentation? Thanks, M. -- Without deviation from the norm, progress is not possible.