Hi Jiaqi, Thank you for sending this out. On Thu, Oct 31, 2024 at 09:21:04PM +0000, Jiaqi Yan wrote: > Currently KVM handles SEA in guest by injecting async SError into > guest directly, bypassing VMM, usually results in guest kernel panic. > > One major situation of guest SEA is when vCPU consumes uncorrectable > memory error on the physical memory. Although SError and guest kernel > panic effectively stops the propagation of corrupted memory, it is not > easy for VMM and guest to recover from memory error in a more graceful > manner. > > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming > thread. > In addition to the benifit that KVM's handling for SEA becomes aligned > with core kernel behavior > - The blast radius in VM can be limited to only the consuming thread > in guest, instead of entire guest kernel, unless the consumption is > from guest kernel. > - VMM now has the chance to do its duties to stop the VM from repeatedly > consuming corrupted data. For example, VMM can unmap the guest page > from stage-2 table to intercept forseen memory poison consumption, > and for every consumption injects SEA to EL1 with synthetic memory > error CPER. > > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM > can opt in this new capability if it prefers SIGBUS than SError > injection during VM init. Now SEA handling in KVM works as follows: I'm somewhat tempted to force the new behavior on userspace unconditionally. Working back from an unexpected SError in the VM to the KVM SEA handler is a bit of a mess, and can be annoying if the operator can't access console logs of the VM. As it stands today, UAPI expectations around SEAs are platform dependent. If APEI claims the SEA and decides to offline a page, the user will get a SIGBUS. So sending a SIGBUS for the case that firmware _doesn't_ claim the SEA seems like a good move from a consistency PoV. But it is a decently-sized change to do without explicit buy-in from userspace so let's see what others think. > 1. Delegate to APEI/GHES to see if SEA can be claimed by them. > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is > enabled for the VM, and the SEA is NOT about translation table, > send SIGBUS BUS_OBJERR signal with host virtual address. > 3. Otherwise directly inject async SError to guest. The other reason I'm a bit lukewarm on user buy in is the UAPI suffers from the same issue we do today: it depends on the platform. If the SEA is claimed by APEI/GHES then the cap does nothing. > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr) > +{ > + /* apei_claim_sea(NULL) expects to mask interrupts itself */ > + lockdep_assert_irqs_enabled(); > + return apei_claim_sea(NULL); > +} Consider dropping parameters from this since they're unused. > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu) > +{ > + bool sigbus_on_sea; > + int idx; > + u64 vcpu_esr = kvm_vcpu_get_esr(vcpu); > + u8 fsc = kvm_vcpu_trap_get_fault(vcpu); > + phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu); > + gfn_t gfn = fault_ipa >> PAGE_SHIFT; > + /* When FnV is set, send 0 as si_addr like what do_sea() does. */ > + unsigned long hva = 0UL; > + > + /* > + * For RAS the host kernel may handle this abort. > + * There is no need to SIGBUS VMM, or pass the error into the guest. > + */ > + if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0) > + return; > + > + sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, > + &(vcpu->kvm->arch.flags)); > + > + /* > + * In addition to userspace opt-in, SIGBUS only makes sense if the > + * abort is NOT about translation table walk and NOT about hardware > + * update of translation table. > + */ > + sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC); Is this because we potentially can't determine a valid HVA for the fault? Maybe these should go out to userspace still with si_addr = 0. -- Thanks, Oliver