Re: [PATCH v2 02/12] KVM: arm64: nv: Sync nested timer state with FEAT_NV2

Marc Zyngier <maz@xxxxxxxxxx> · Mon, 27 Jan 2025 17:15:57 +0000

+ Wei-Lin Chang, who spotted something similar 3 weeks ago, that I
didn't manage to investigate in time.

On Sun, 26 Jan 2025 15:25:39 +0000,
Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx> wrote:
> 
> 
> Hi Marc,
> 
> Thank you for these patches. We (myself and Dmytro Terletskyi) are
> trying to use this series to launch up Xen on Amazon Graviton 4 platform.
> Graviton 4 is built on Neoverse V2 cores and does **not** support
> FEAT_ECV. Looks like we have found issue in this particular patch on
> this particular setup.
> 
> Marc Zyngier <maz@xxxxxxxxxx> writes:
> 
> > Emulating the timers with FEAT_NV2 is a bit odd, as the timers
> > can be reconfigured behind our back without the hypervisor even
> > noticing. In the VHE case, that's an actual regression in the
> > architecture...
> >
> > Co-developed-by: Christoffer Dall <christoffer.dall@xxxxxxx>
> > Signed-off-by: Christoffer Dall <christoffer.dall@xxxxxxx>
> > Signed-off-by: Marc Zyngier <maz@xxxxxxxxxx>
> > ---
> >  arch/arm64/kvm/arch_timer.c  | 44 ++++++++++++++++++++++++++++++++++++
> >  arch/arm64/kvm/arm.c         |  3 +++
> >  include/kvm/arm_arch_timer.h |  1 +
> >  3 files changed, 48 insertions(+)
> >
> > diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> > index 1215df5904185..ee5f732fbbece 100644
> > --- a/arch/arm64/kvm/arch_timer.c
> > +++ b/arch/arm64/kvm/arch_timer.c
> > @@ -905,6 +905,50 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
> >  		kvm_timer_blocking(vcpu);
> >  }
> >  
> > +void kvm_timer_sync_nested(struct kvm_vcpu *vcpu)
> > +{
> > +	/*
> > +	 * When NV2 is on, guest hypervisors have their EL1 timer register
> > +	 * accesses redirected to the VNCR page. Any guest action taken on
> > +	 * the timer is postponed until the next exit, leading to a very
> > +	 * poor quality of emulation.
> > +	 */
> > +	if (!is_hyp_ctxt(vcpu))
> > +		return;
> > +
> > +	if (!vcpu_el2_e2h_is_set(vcpu)) {
> > +		/*
> > +		 * A non-VHE guest hypervisor doesn't have any direct access
> > +		 * to its timers: the EL2 registers trap (and the HW is
> > +		 * fully emulated), while the EL0 registers access memory
> > +		 * despite the access being notionally direct. Boo.
> > +		 *
> > +		 * We update the hardware timer registers with the
> > +		 * latest value written by the guest to the VNCR page
> > +		 * and let the hardware take care of the rest.
> > +		 */
> > +		write_sysreg_el0(__vcpu_sys_reg(vcpu, CNTV_CTL_EL0),  SYS_CNTV_CTL);
> > +		write_sysreg_el0(__vcpu_sys_reg(vcpu, CNTV_CVAL_EL0), SYS_CNTV_CVAL);
> > +		write_sysreg_el0(__vcpu_sys_reg(vcpu, CNTP_CTL_EL0),  SYS_CNTP_CTL);
> > +		write_sysreg_el0(__vcpu_sys_reg(vcpu, CNTP_CVAL_EL0), SYS_CNTP_CVAL);
> 
> 
> Here you are overwriting trapped/emulated state of  EL2 vtimer with EL0
> vtimer, which renders all writes to EL2 timer registers useless.
> 
> This is the behavior we observed:
> 
>  1. Xen writes to CNTHP_CVAL_EL2, which is trapped and handled in
>     kvm_arm_timer_write_sysreg().
> 
>  2. timer_set_cval() updates __vcpu_sys_reg(vcpu, CNTHP_CVAL_EL2)
> 
>  3. timer_restore_state() updates real CNTP_CVAL_EL0 with value from
>    __vcpu_sys_reg(vcpu, CNTHP_CVAL_EL2)
> 
>  (so far so good)
> 
>  4. kvm_timer_sync_nested() is called and it updates real CNTP_CVAL_EL0
>  with __vcpu_sys_reg(vcpu, CNTP_CVAL_EL0), overwriting value that we got
>  from Xen.
> 
> The same stands for other hypervisor timer registers of course.
> 
> I am wondering, what is the correct fix for this issue?
> 
> Also, we are observing issues with timers in Dom0, which seems related
> to this, but we didn't pinpoint exact problem yet.

Thanks for the great debug above, much appreciated.

As Wei-Lin pointed out in their email[1], there is a copious amount of
nonsense here. This is due to leftovers from the mix of NV+NV2 that
KVM was initially trying to handle before switching to NV2 only.

The whole VHE vs nVHE makes no sense at all, and both should have the
same behaviour. The only difference is around what gets trapped, and
what doesn't.

Finally, this crap is masking a subtle bug in timer_emulate(), where
we return too early on updating the IRQ state, hence failing to
publish the interrupt state.

Could you please give the hack below a go with your setup and report
whether it solves this particular issue?

diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
index 0e29958e20187..56f4905cdb859 100644
--- a/arch/arm64/kvm/arch_timer.c
+++ b/arch/arm64/kvm/arch_timer.c
@@ -471,10 +471,8 @@ static void timer_emulate(struct arch_timer_context *ctx)
 
 	trace_kvm_timer_emulate(ctx, should_fire);
 
-	if (should_fire != ctx->irq.level) {
+	if (should_fire != ctx->irq.level)
 		kvm_timer_update_irq(ctx->vcpu, should_fire, ctx);
-		return;
-	}
 
 	kvm_timer_update_status(ctx, should_fire);
 
@@ -976,31 +974,21 @@ void kvm_timer_sync_nested(struct kvm_vcpu *vcpu)
 	 * which allows trapping of the timer registers even with NV2.
 	 * Still, this is still worse than FEAT_NV on its own. Meh.
 	 */
-	if (!vcpu_el2_e2h_is_set(vcpu)) {
-		if (cpus_have_final_cap(ARM64_HAS_ECV))
-			return;
-
-		/*
-		 * A non-VHE guest hypervisor doesn't have any direct access
-		 * to its timers: the EL2 registers trap (and the HW is
-		 * fully emulated), while the EL0 registers access memory
-		 * despite the access being notionally direct. Boo.
-		 *
-		 * We update the hardware timer registers with the
-		 * latest value written by the guest to the VNCR page
-		 * and let the hardware take care of the rest.
-		 */
-		write_sysreg_el0(__vcpu_sys_reg(vcpu, CNTV_CTL_EL0),  SYS_CNTV_CTL);
-		write_sysreg_el0(__vcpu_sys_reg(vcpu, CNTV_CVAL_EL0), SYS_CNTV_CVAL);
-		write_sysreg_el0(__vcpu_sys_reg(vcpu, CNTP_CTL_EL0),  SYS_CNTP_CTL);
-		write_sysreg_el0(__vcpu_sys_reg(vcpu, CNTP_CVAL_EL0), SYS_CNTP_CVAL);
-	} else {
+	if (!cpus_have_final_cap(ARM64_HAS_ECV)) {
 		/*
 		 * For a VHE guest hypervisor, the EL2 state is directly
-		 * stored in the host EL1 timers, while the emulated EL0
+		 * stored in the host EL1 timers, while the emulated EL1
 		 * state is stored in the VNCR page. The latter could have
 		 * been updated behind our back, and we must reset the
 		 * emulation of the timers.
+		 *
+		 * A non-VHE guest hypervisor doesn't have any direct access
+		 * to its timers: the EL2 registers trap despite being
+		 * notionally direct (we use the EL1 HW, as for VHE), while
+		 * the EL1 registers access memory.
+		 *
+		 * In both cases, process the emulated timers on each guest
+		 * exit. Boo.
 		 */
 		struct timer_map map;
 		get_timer_map(vcpu, &map);

Thanks,

	M.

[1] https://lore.kernel.org/r/fqiqfjzwpgbzdtouu2pwqlu7llhnf5lmy4hzv5vo6ph4v3vyls@jdcfy3fjjc5k

-- 
Without deviation from the norm, progress is not possible.