Re: [PATCH v4 20/21] KVM: arm64: Take any host SError before entering the guest

Christoffer Dall <cdall@xxxxxxxxxx> · Wed, 1 Nov 2017 05:55:50 +0100

On Tue, Oct 31, 2017 at 11:43:42AM +0000, James Morse wrote:
> Hi Christoffer,
> 
> On 31/10/17 06:23, Christoffer Dall wrote:
> > On Thu, Oct 19, 2017 at 03:58:06PM +0100, James Morse wrote:
> >> On VHE systems KVM masks SError before switching the VBAR value. Any
> >> host RAS error that the CPU knew about before world-switch may become
> >> pending as an SError during world-switch, and only be taken once we enter
> >> the guest.
> >>
> >> Until KVM can take RAS SErrors during world switch, add an ESB to
> >> force any RAS errors to be synchronised and taken on the host before
> >> we enter world switch.
> >>
> >> RAS errors that become pending during world switch are still taken
> >> once we enter the guest.
> 
> >> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> >> index cf5d78ba14b5..5dc6f2877762 100644
> >> --- a/arch/arm64/include/asm/kvm_host.h
> >> +++ b/arch/arm64/include/asm/kvm_host.h
> >> @@ -392,6 +392,7 @@ static inline void __cpu_init_stage2(void)
> >>  
> >>  static inline void kvm_arm_vhe_guest_enter(void)
> >>  {
> >> +	esb();
> 
> > I don't fully appreciate what the point of this is?
> > 
> > As I understand it, our fundamental goal here is to try to distinguish
> > between errors happening on the host or in the guest.
> 
> Not just host/guest, but also those we can and can't handle.
> 
> KVM can't currently take an SError during world switch, so a RAS error that the
> CPU was hoping to defer may spread from the host into KVM's
> no-SError:world-switch code. If this happens it will (almost certainly) have to
> be re-classified as uncontainable.
> 
> There is also a firmware-first angle here: NOTIFY_SEI can't be delivered if the
> normal world has SError masked, so any error that spreads past this point
> becomes a reboot-by-firmware instead of an OS notification and almost-helpful
> error message.
> 
> 
> > If that's correct, then why don't we do it at the last possible moment
> > when we still have a scratch register left, in the world switch code
> > itself, and in the case abort the guest entry and report back a "host
> > SError" return code.
> 
> We have IESB to run the error-barrier as we enter the guest. This would make any
> host error pending as an SError, and we would exit the guest immediately. But if
> there was an RAS error during world switch, by this point its likely to be
> classified as uncontainable.
> 
> This esb() is trying to keep this window of code as small as possible, to just
> errors that occur during world switch.
> 
> With your vcpu load/save this window becomes a lot smaller, it may be possible
> to get a VHE-host's arch-code SError handler to take errors from EL2, in which
> case this barrier can disappear.
> (note to self: guest may still own the debug hardware)
> 

ok, thanks for your detailed explanation.  I didn't consider that the
classification of a RAS error as containable vs. non-containable
depended on where we take the exception.

Acked-by: Christoffer Dall <christoffer.dall@xxxxxxxxxx>

_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm