Re: [RFC PATCH v1] KVM: arm64: Introduce KVM_CAP_ARM_SIGBUS_ON_SEA

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 31 Oct 2024 21:21:04 +0000,
Jiaqi Yan <jiaqiyan@xxxxxxxxxx> wrote:
> 
> Currently KVM handles SEA in guest by injecting async SError into
> guest directly, bypassing VMM, usually results in guest kernel panic.
> 
> One major situation of guest SEA is when vCPU consumes uncorrectable
> memory error on the physical memory. Although SError and guest kernel
> panic effectively stops the propagation of corrupted memory, it is not
> easy for VMM and guest to recover from memory error in a more graceful
> manner.
> 
> Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
> how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
> thread.

Can you elaborate on why the delivery of a signal is preferable to
simply exiting back to userspace with a description of the error?
Signals are usually not generated by KVM, and are a pretty twisted way
to generate an exit.

> In addition to the benifit that KVM's handling for SEA becomes aligned
> with core kernel behavior
> - The blast radius in VM can be limited to only the consuming thread
>   in guest, instead of entire guest kernel, unless the consumption is
>   from guest kernel.
> - VMM now has the chance to do its duties to stop the VM from repeatedly
>   consuming corrupted data. For example, VMM can unmap the guest page
>   from stage-2 table to intercept forseen memory poison consumption,

Not quite. The VMM doesn't manage stage-2. It can remove the page from
the VMA if it has it mapped, but that's it. The kernel deals with S2.

Which brings me to the next subject: when the kernel unmaps the page
at S2, it is likely to use CMOs. Can these CMOs create RAS error
themselves?

>   and for every consumption injects SEA to EL1 with synthetic memory
>   error CPER.

Why do we need to involve ACPI here? I would expect the production of
an architected error record instead. Or at least be given the option.

> Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
> can opt in this new capability if it prefers SIGBUS than SError
> injection during VM init. Now SEA handling in KVM works as follows:
> 1. Delegate to APEI/GHES to see if SEA can be claimed by them.
> 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
>    enabled for the VM, and the SEA is NOT about translation table,
>    send SIGBUS BUS_OBJERR signal with host virtual address.

And what if it is? S1 PTs are backed by userspace memory, like
anything else. I don't think we should have a different treatment of
those, because the HW wouldn't treat them differently either.

> 3. Otherwise directly inject async SError to guest.
>
> Tested on a machine running Siryn AmpereOne processor. A in-house VMM
> that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application
> in VM allocated some memory buffer. The test used EINJ to inject an
> uncorrectable recoverable memory error at a page in the allocated memory
> buffer. The dummy application then consumed the memory error. Some hack
> was done to make core kernel's memory_failure triggered by poison
> generation to fail, so KVM had to deal with the SEA guest abort due to
> poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with
> valid host virtual address of the poisoned page. VMM then injected a SEA
> into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last
> the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the
> guest survived and continued to run.
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@xxxxxxxxxx>
> ---
>  arch/arm64/include/asm/kvm_host.h |  2 +
>  arch/arm64/include/asm/kvm_ras.h  | 20 ++++----
>  arch/arm64/kvm/Makefile           |  2 +-
>  arch/arm64/kvm/arm.c              |  5 ++
>  arch/arm64/kvm/kvm_ras.c          | 77 +++++++++++++++++++++++++++++++
>  arch/arm64/kvm/mmu.c              |  8 +---
>  include/uapi/linux/kvm.h          |  1 +
>  7 files changed, 98 insertions(+), 17 deletions(-)
>  create mode 100644 arch/arm64/kvm/kvm_ras.c
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index bf64fed9820ea..eb37a2489411a 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -334,6 +334,8 @@ struct kvm_arch {
>  	/* Fine-Grained UNDEF initialised */
>  #define KVM_ARCH_FLAG_FGU_INITIALIZED			8
>  	unsigned long flags;
> +	/* Instead of injecting SError into guest, SIGBUS VMM */
> +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA			9

nit: why do you put this definition out of sequence (below 'flags')?

>  
>  	/* VM-wide vCPU feature set */
>  	DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES);
> diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h
> index 87e10d9a635b5..4bb7a424e3f6c 100644
> --- a/arch/arm64/include/asm/kvm_ras.h
> +++ b/arch/arm64/include/asm/kvm_ras.h
> @@ -11,15 +11,17 @@
>  #include <asm/acpi.h>
>  
>  /*
> - * Was this synchronous external abort a RAS notification?
> - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> + * Handle synchronous external abort (SEA) in the following order:
> + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we
> + *    are all done.
> + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT
> + *    about translation table, send SIGBUS
> + *    - si_code is BUS_OBJERR.
> + *    - si_addr will be 0 when accurate HVA is unavailable.
> + * 3. Otherwise, directly inject an async SError to guest.
> + *
> + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*.
>   */
> -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr)
> -{
> -	/* apei_claim_sea(NULL) expects to mask interrupts itself */
> -	lockdep_assert_irqs_enabled();
> -
> -	return apei_claim_sea(NULL);
> -}
> +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
>  
>  #endif /* __ARM64_KVM_RAS_H__ */
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 3cf7adb2b5038..c4a3a6d4870e6 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
>  	 vgic/vgic-v3.o vgic/vgic-v4.o \
>  	 vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
>  	 vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
> -	 vgic/vgic-its.o vgic/vgic-debug.o
> +	 vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o
>  
>  kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
>  kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 48cafb65d6acf..bb97ad678dbec 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  		}
>  		mutex_unlock(&kvm->slots_lock);
>  		break;
> +	case KVM_CAP_ARM_SIGBUS_ON_SEA:
> +		r = 0;
> +		set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags);

Shouldn't this be somehow gated on the VM being RAS aware?

> +		break;
>  	default:
>  		break;
>  	}
> @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_ARM_SYSTEM_SUSPEND:
>  	case KVM_CAP_IRQFD_RESAMPLE:
>  	case KVM_CAP_COUNTER_OFFSET:
> +	case KVM_CAP_ARM_SIGBUS_ON_SEA:
>  		r = 1;
>  		break;
>  	case KVM_CAP_SET_GUEST_DEBUG2:
> diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c
> new file mode 100644
> index 0000000000000..3225462bcbcda
> --- /dev/null
> +++ b/arch/arm64/kvm/kvm_ras.c
> @@ -0,0 +1,77 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/bitops.h>
> +#include <linux/kvm_host.h>
> +
> +#include <asm/kvm_emulate.h>
> +#include <asm/kvm_ras.h>
> +#include <asm/system_misc.h>
> +
> +/*
> + * For synchrnous external instruction or data abort, not on translation
> + * table walk or hardware update of translation table, is FAR_EL2 valid?
> + */
> +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu)
> +{
> +	return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV);
> +}
> +
> +/*
> + * Was this synchronous external abort a RAS notification?
> + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> + */
> +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
> +{
> +	/* apei_claim_sea(NULL) expects to mask interrupts itself */
> +	lockdep_assert_irqs_enabled();
> +	return apei_claim_sea(NULL);
> +}
> +
> +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> +{
> +	bool sigbus_on_sea;
> +	int idx;
> +	u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
> +	u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
> +	phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> +	gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> +	/* When FnV is set, send 0 as si_addr like what do_sea() does. */
> +	unsigned long hva = 0UL;
> +
> +	/*
> +	 * For RAS the host kernel may handle this abort.
> +	 * There is no need to SIGBUS VMM, or pass the error into the guest.
> +	 */
> +	if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
> +		return;
> +
> +	sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
> +				 &(vcpu->kvm->arch.flags));
> +
> +	/*
> +	 * In addition to userspace opt-in, SIGBUS only makes sense if the
> +	 * abort is NOT about translation table walk and NOT about hardware
> +	 * update of translation table.
> +	 */
> +	sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);
> +
> +	/* Pass the error directly into the guest. */
> +	if (!sigbus_on_sea) {
> +		kvm_inject_vabt(vcpu);
> +		return;
> +	}
> +
> +	if (kvm_vcpu_sea_far_valid(vcpu)) {
> +		idx = srcu_read_lock(&vcpu->kvm->srcu);
> +		hva = gfn_to_hva(vcpu->kvm, gfn);
> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +	}
> +
> +	/*
> +	 * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that
> +	 * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea()
> +	 * does if apei_claim_sea() fails.
> +	 */
> +	arm64_notify_die("synchronous external abort",
> +			 current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr);

This is the point where I really think we should simply trigger an
exit with all that syndrome information stashed in kvm_run, like any
other event requiring userspace help.

Also: where is the documentation?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux