On Fri, Feb 26, 2021 at 06:35:42PM +0000, Marc Zyngier wrote: > On 2021-02-26 18:12, Will Deacon wrote: > > Commit 7db21530479f ("KVM: arm64: Restore hyp when panicking in guest > > context") tracks the currently running vCPU, clearing the pointer to > > NULL on exit from a guest. > > > > Unfortunately, the use of 'set_loaded_vcpu' clobbers x1 to point at the > > kvm_hyp_ctxt instead of the vCPU context, causing the subsequent RAS > > code to go off into the weeds when it saves the DISR assuming that the > > CPU context is embedded in a struct vCPU. > > > > Leave x1 alone and use x3 as a temporary register instead when clearing > > the vCPU on the guest exit path. > > > > Cc: Marc Zyngier <maz@xxxxxxxxxx> > > Cc: Andrew Scull <ascull@xxxxxxxxxx> > > Cc: <stable@xxxxxxxxxxxxxxx> > > Fixes: 7db21530479f ("KVM: arm64: Restore hyp when panicking in guest > > context") > > Suggested-by: Quentin Perret <qperret@xxxxxxxxxx> > > Signed-off-by: Will Deacon <will@xxxxxxxxxx> > > --- > > > > This was pretty awful to debug! > > > > arch/arm64/kvm/hyp/entry.S | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S > > index b0afad7a99c6..0c66a1d408fd 100644 > > --- a/arch/arm64/kvm/hyp/entry.S > > +++ b/arch/arm64/kvm/hyp/entry.S > > @@ -146,7 +146,7 @@ SYM_INNER_LABEL(__guest_exit, SYM_L_GLOBAL) > > // Now restore the hyp regs > > restore_callee_saved_regs x2 > > > > - set_loaded_vcpu xzr, x1, x2 > > + set_loaded_vcpu xzr, x2, x3 > > > > alternative_if ARM64_HAS_RAS_EXTN > > // If we have the RAS extensions we can consume a pending error > > Grmbl... How comes we have never seen that for the past 5 months, > including on CPUs that implement RAS? I think it's probably a combination of (a) not having a massive testing community (b) not having tools that would scream about this (e.g. I don't think you could detect this with KASAN) and (c) the nature of the corruption being mostly benign in practice. We found it in pKVM development because it landed on the vtcr we were restoring when coming out of suspend, which then meant the page-table code went wonky on the next stage-2 fault because it got the wrong start level and kept returning -EAGAIN because it thought a table was a leaf. So even then, the failure mode is horribly subtle. Will _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm