Re: [RFC PATCH v1] KVM: arm64: Introduce KVM_CAP_ARM_SIGBUS_ON_SEA

Jiaqi Yan <jiaqiyan@xxxxxxxxxx> · Thu, 7 Nov 2024 20:54:43 -0800

Hi Oliver,

Sorry for getting back to you late...

On Thu, Oct 31, 2024 at 2:46 PM Oliver Upton <oliver.upton@xxxxxxxxx> wrote:
>
> Hi Jiaqi,
>
> Thank you for sending this out.
>
> On Thu, Oct 31, 2024 at 09:21:04PM +0000, Jiaqi Yan wrote:
> > Currently KVM handles SEA in guest by injecting async SError into
> > guest directly, bypassing VMM, usually results in guest kernel panic.
> >
> > One major situation of guest SEA is when vCPU consumes uncorrectable
> > memory error on the physical memory. Although SError and guest kernel
> > panic effectively stops the propagation of corrupted memory, it is not
> > easy for VMM and guest to recover from memory error in a more graceful
> > manner.
> >
> > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
> > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
> > thread.
> > In addition to the benifit that KVM's handling for SEA becomes aligned
> > with core kernel behavior
> > - The blast radius in VM can be limited to only the consuming thread
> >   in guest, instead of entire guest kernel, unless the consumption is
> >   from guest kernel.
> > - VMM now has the chance to do its duties to stop the VM from repeatedly
> >   consuming corrupted data. For example, VMM can unmap the guest page
> >   from stage-2 table to intercept forseen memory poison consumption,
> >   and for every consumption injects SEA to EL1 with synthetic memory
> >   error CPER.
> >
> > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
> > can opt in this new capability if it prefers SIGBUS than SError
> > injection during VM init. Now SEA handling in KVM works as follows:
>
> I'm somewhat tempted to force the new behavior on userspace
> unconditionally. Working back from an unexpected SError in the VM to the
> KVM SEA handler is a bit of a mess, and can be annoying if the operator
> can't access console logs of the VM.

Ack, I also think involving VMM is preferable than injecting SError
directly to guest.

>
> As it stands today, UAPI expectations around SEAs are platform
> dependent. If APEI claims the SEA and decides to offline a page, the
> user will get a SIGBUS.
>
> So sending a SIGBUS for the case that firmware _doesn't_ claim the SEA
> seems like a good move from a consistency PoV. But it is a decently-sized
> change to do without explicit buy-in from userspace so let's see what
> others think.

Sounds good, I will wait for a couple of more days, and if the opt-in
part isn't necessary to anyone else, I will remove it from the next
revision.

>
> > 1. Delegate to APEI/GHES to see if SEA can be claimed by them.
> > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
> >    enabled for the VM, and the SEA is NOT about translation table,
> >    send SIGBUS BUS_OBJERR signal with host virtual address.
> > 3. Otherwise directly inject async SError to guest.
>
> The other reason I'm a bit lukewarm on user buy in is the UAPI suffers
> from the same issue we do today: it depends on the platform. If the SEA
> is claimed by APEI/GHES then the cap does nothing.

Good point, yeah, the path of KVM handling SEA and sending SIGBUS
should be treated as fallback code to APEI/GHES.

>
> > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
> > +{
> > +     /* apei_claim_sea(NULL) expects to mask interrupts itself */
> > +     lockdep_assert_irqs_enabled();
> > +     return apei_claim_sea(NULL);
> > +}
>
> Consider dropping parameters from this since they're unused.

Ack, will do in the next revision.

>
> > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > +{
> > +     bool sigbus_on_sea;
> > +     int idx;
> > +     u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
> > +     u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
> > +     phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> > +     gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> > +     /* When FnV is set, send 0 as si_addr like what do_sea() does. */
> > +     unsigned long hva = 0UL;
> > +
> > +     /*
> > +      * For RAS the host kernel may handle this abort.
> > +      * There is no need to SIGBUS VMM, or pass the error into the guest.
> > +      */
> > +     if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
> > +             return;
> > +
> > +     sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
> > +                              &(vcpu->kvm->arch.flags));
> > +
> > +     /*
> > +      * In addition to userspace opt-in, SIGBUS only makes sense if the
> > +      * abort is NOT about translation table walk and NOT about hardware
> > +      * update of translation table.
> > +      */
> > +     sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);
>
> Is this because we potentially can't determine a valid HVA for the
> fault? Maybe these should go out to userspace still with si_addr = 0.

No, it is not related to the availability of a valid HVA. In this
patch, as long as it decides to sigbus_on_sea, si_addr can be 0 if a
valid HVA isn't available when !kvm_vcpu_sea_far_valid (OR when
HPFAR_EL2 cannot be translated from a valid FAR_EL2, which I think
requires some improvement).

The code here wants to limit SIGBUS _BUS_OBJERR_ to only SEA and
parity+ECC error, similar to the code here (for SEA)
https://elixir.bootlin.com/linux/v6.11.6/source/arch/arm64/mm/fault.c#L771
and here (for synchronous parity or ECC error)
https://elixir.bootlin.com/linux/v6.11.6/source/arch/arm64/mm/fault.c#L779.

>
> --
> Thanks,
> Oliver