On Thu, Nov 02, 2017 at 12:18:20PM +0000, James Morse wrote: > Hi Christoffer, > > On 01/11/17 04:55, Christoffer Dall wrote: > > On Tue, Oct 31, 2017 at 11:43:42AM +0000, James Morse wrote: > >> On 31/10/17 06:23, Christoffer Dall wrote: > >>> On Thu, Oct 19, 2017 at 03:58:06PM +0100, James Morse wrote: > >>>> On VHE systems KVM masks SError before switching the VBAR value. Any > >>>> host RAS error that the CPU knew about before world-switch may become > >>>> pending as an SError during world-switch, and only be taken once we enter > >>>> the guest. > >>>> > >>>> Until KVM can take RAS SErrors during world switch, add an ESB to > >>>> force any RAS errors to be synchronised and taken on the host before > >>>> we enter world switch. > >>>> > >>>> RAS errors that become pending during world switch are still taken > >>>> once we enter the guest. > >> > >>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > >>>> index cf5d78ba14b5..5dc6f2877762 100644 > >>>> --- a/arch/arm64/include/asm/kvm_host.h > >>>> +++ b/arch/arm64/include/asm/kvm_host.h > >>>> @@ -392,6 +392,7 @@ static inline void __cpu_init_stage2(void) > >>>> > >>>> static inline void kvm_arm_vhe_guest_enter(void) > >>>> { > >>>> + esb(); > >> > >>> I don't fully appreciate what the point of this is? > >>> > >>> As I understand it, our fundamental goal here is to try to distinguish > >>> between errors happening on the host or in the guest. > >> > >> Not just host/guest, but also those we can and can't handle. > >> > >> KVM can't currently take an SError during world switch, so a RAS error that the > >> CPU was hoping to defer may spread from the host into KVM's > >> no-SError:world-switch code. If this happens it will (almost certainly) have to > >> be re-classified as uncontainable. > >> > >> There is also a firmware-first angle here: NOTIFY_SEI can't be delivered if the > >> normal world has SError masked, so any error that spreads past this point > >> becomes a reboot-by-firmware instead of an OS notification and almost-helpful > >> error message. > >> > >> > >>> If that's correct, then why don't we do it at the last possible moment > >>> when we still have a scratch register left, in the world switch code > >>> itself, and in the case abort the guest entry and report back a "host > >>> SError" return code. > >> > >> We have IESB to run the error-barrier as we enter the guest. This would make any > >> host error pending as an SError, and we would exit the guest immediately. But if > >> there was an RAS error during world switch, by this point its likely to be > >> classified as uncontainable. > >> > >> This esb() is trying to keep this window of code as small as possible, to just > >> errors that occur during world switch. > >> > >> With your vcpu load/save this window becomes a lot smaller, it may be possible > >> to get a VHE-host's arch-code SError handler to take errors from EL2, in which > >> case this barrier can disappear. > >> (note to self: guest may still own the debug hardware) > >> > > > > ok, thanks for your detailed explanation. I didn't consider that the > > classification of a RAS error as containable vs. non-containable > > depended on where we take the exception. > > Will makes the point over on patch 11 that until we have different handling for > these different classifications of error, there isn't much point doing this now. > (i.e. we treat an error generated here, or when we enter the guest in the same way). > > I was trying to keep my eye on what we need for kernel-first support, so we > don't have to change the code twice, we just expand the error handling to do better. I figured as much... > > I'll drop this patch for now, it will come back if/when we get kernel-first > support for RAS. Either way is fine from my point of view. > > > What about firmware-first? Firmware can always take these errors when the normal > world is running. Dropping the barrier means its up to the CPU when any error > gets reported, if firmware has to use NOTIFY_SEI it will have to do a reboot if > the error occurs during world-switch (as SError is masked). If an error spreads > over this boundary, that's just tough-luck, the kernel would have panic'd anyway. > > Does a non-secure esb() cause the error to be delivered to firmware on the secure side if anything is pending? I'm not sure I fully understand the interaction between issuing an esb() in non-secure and firmware handling a RAS error; I thought there would be none, and that this was only for kernel-first ? Thanks, -Christoffer _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm