Hi Oliver, Sorry for getting back to you late... On Thu, Oct 31, 2024 at 2:46 PM Oliver Upton <oliver.upton@xxxxxxxxx> wrote: > > Hi Jiaqi, > > Thank you for sending this out. > > On Thu, Oct 31, 2024 at 09:21:04PM +0000, Jiaqi Yan wrote: > > Currently KVM handles SEA in guest by injecting async SError into > > guest directly, bypassing VMM, usually results in guest kernel panic. > > > > One major situation of guest SEA is when vCPU consumes uncorrectable > > memory error on the physical memory. Although SError and guest kernel > > panic effectively stops the propagation of corrupted memory, it is not > > easy for VMM and guest to recover from memory error in a more graceful > > manner. > > > > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like > > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming > > thread. > > In addition to the benifit that KVM's handling for SEA becomes aligned > > with core kernel behavior > > - The blast radius in VM can be limited to only the consuming thread > > in guest, instead of entire guest kernel, unless the consumption is > > from guest kernel. > > - VMM now has the chance to do its duties to stop the VM from repeatedly > > consuming corrupted data. For example, VMM can unmap the guest page > > from stage-2 table to intercept forseen memory poison consumption, > > and for every consumption injects SEA to EL1 with synthetic memory > > error CPER. > > > > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM > > can opt in this new capability if it prefers SIGBUS than SError > > injection during VM init. Now SEA handling in KVM works as follows: > > I'm somewhat tempted to force the new behavior on userspace > unconditionally. Working back from an unexpected SError in the VM to the > KVM SEA handler is a bit of a mess, and can be annoying if the operator > can't access console logs of the VM. Ack, I also think involving VMM is preferable than injecting SError directly to guest. > > As it stands today, UAPI expectations around SEAs are platform > dependent. If APEI claims the SEA and decides to offline a page, the > user will get a SIGBUS. > > So sending a SIGBUS for the case that firmware _doesn't_ claim the SEA > seems like a good move from a consistency PoV. But it is a decently-sized > change to do without explicit buy-in from userspace so let's see what > others think. Sounds good, I will wait for a couple of more days, and if the opt-in part isn't necessary to anyone else, I will remove it from the next revision. > > > 1. Delegate to APEI/GHES to see if SEA can be claimed by them. > > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is > > enabled for the VM, and the SEA is NOT about translation table, > > send SIGBUS BUS_OBJERR signal with host virtual address. > > 3. Otherwise directly inject async SError to guest. > > The other reason I'm a bit lukewarm on user buy in is the UAPI suffers > from the same issue we do today: it depends on the platform. If the SEA > is claimed by APEI/GHES then the cap does nothing. Good point, yeah, the path of KVM handling SEA and sending SIGBUS should be treated as fallback code to APEI/GHES. > > > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr) > > +{ > > + /* apei_claim_sea(NULL) expects to mask interrupts itself */ > > + lockdep_assert_irqs_enabled(); > > + return apei_claim_sea(NULL); > > +} > > Consider dropping parameters from this since they're unused. Ack, will do in the next revision. > > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu) > > +{ > > + bool sigbus_on_sea; > > + int idx; > > + u64 vcpu_esr = kvm_vcpu_get_esr(vcpu); > > + u8 fsc = kvm_vcpu_trap_get_fault(vcpu); > > + phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu); > > + gfn_t gfn = fault_ipa >> PAGE_SHIFT; > > + /* When FnV is set, send 0 as si_addr like what do_sea() does. */ > > + unsigned long hva = 0UL; > > + > > + /* > > + * For RAS the host kernel may handle this abort. > > + * There is no need to SIGBUS VMM, or pass the error into the guest. > > + */ > > + if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0) > > + return; > > + > > + sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, > > + &(vcpu->kvm->arch.flags)); > > + > > + /* > > + * In addition to userspace opt-in, SIGBUS only makes sense if the > > + * abort is NOT about translation table walk and NOT about hardware > > + * update of translation table. > > + */ > > + sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC); > > Is this because we potentially can't determine a valid HVA for the > fault? Maybe these should go out to userspace still with si_addr = 0. No, it is not related to the availability of a valid HVA. In this patch, as long as it decides to sigbus_on_sea, si_addr can be 0 if a valid HVA isn't available when !kvm_vcpu_sea_far_valid (OR when HPFAR_EL2 cannot be translated from a valid FAR_EL2, which I think requires some improvement). The code here wants to limit SIGBUS _BUS_OBJERR_ to only SEA and parity+ECC error, similar to the code here (for SEA) https://elixir.bootlin.com/linux/v6.11.6/source/arch/arm64/mm/fault.c#L771 and here (for synchronous parity or ECC error) https://elixir.bootlin.com/linux/v6.11.6/source/arch/arm64/mm/fault.c#L779. > > -- > Thanks, > Oliver