Re: [PATCH v2 13/17] KVM: arm64: nv: Add SW walker for AT S1 emulation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Alex,

On Fri, 09 Aug 2024 13:43:33 +0100,
Alexandru Elisei <alexandru.elisei@xxxxxxx> wrote:
> 
> Hi Marc,
> 
> Finally managed to go through the patch. Bunch of nitpicks from me (can be
> safely ignored), and some corner cases where KVM deviates from the spec.

Thanks for taking the time to go through this mess.

[...]

> > +		break;
> > +	default:
> > +		return (vcpu_el2_e2h_is_set(vcpu) &&
> > +			vcpu_el2_tge_is_set(vcpu)) ? TR_EL20 : TR_EL10;
> 
> This also looks correct to me. Following the pseudocode was not trivial, so I'm
> leaving this here in case I made a mistake somewhere.
> 
> For the S1E0* variants: AArch64.AT(el_in=EL0) => TranslationRegime(el=EL0) =>
> ELIsInHost(el=EL0); ELIsInHost(el=EL0) is true if {E2H,TGE} = {1,1}, and in this
> case TranslationRegime(el=EL0) returns Regime_EL20, otherwise Regime_EL10.
> 
> For the S1E1* variants: AArch64.AT(el_in=EL1), where:
> 
> - if ELIsInHost(el=EL0) is true, then 'el' is changed to EL2, where
>   ELIsInHost(el=EL0) is true if {E2H,TGE} = {1,1}. In this case,
>   TranslationRegime(el=EL2) will return Regime_EL20.
> 
> - if ELIsInHost(el=EL0) is false, then 'el' remains EL1, and
>   TranslationRegime(el=EL1) returns Regime_EL10.

Yup. Somehow, the C version is far easier to understand!

> 
> > +	}
> > +}
> > +
> > +static int setup_s1_walk(struct kvm_vcpu *vcpu, u32 op, struct s1_walk_info *wi,
> > +			 struct s1_walk_result *wr, u64 va)
> > +{
> > +	u64 sctlr, tcr, tg, ps, ia_bits, ttbr;
> > +	unsigned int stride, x;
> > +	bool va55, tbi, lva, as_el0;
> > +
> > +	wi->regime = compute_translation_regime(vcpu, op);
> > +	as_el0 = (op == OP_AT_S1E0R || op == OP_AT_S1E0W);
> > +
> > +	va55 = va & BIT(55);
> > +
> > +	if (wi->regime == TR_EL2 && va55)
> > +		goto addrsz;
> > +
> > +	wi->s2 = wi->regime == TR_EL10 && (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_VM);
> 
> According to AArch64.NSS2TTWParams(), stage 2 is enabled if HCR_EL2.VM or
> HCR_EL2.DC.
> 
> From the description of HCR_EL2.DC, when the bit is set:
> 
> 'The PE behaves as if the value of the HCR_EL2.VM field is 1 for all purposes
> other than returning the value of a direct read of HCR_EL2.'

Yup, good point. And as you noticed further down, the HCR_EL2.DC
handling is currently a mess.

[...]

> > +	/* Let's put the MMU disabled case aside immediately */
> > +	if (!(sctlr & SCTLR_ELx_M) ||
> > +	    (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> 
> I think the condition doesn't match AArch64.S1Enabled():
> 
> - if regime is Regime_EL20 or Regime_EL2, then S1 is disabled if and only if
>   SCTLR_EL2.M is not set; it doesn't matter if HCR_EL2.DC is set, because "[..]
>   this field has no effect on the EL2, EL2&0, and EL3 translation regimes."
>   (HCR_EL2.DC bit field description).
> 
> - if regime is Regime_EL10, then S1 is disabled if SCTLR_EL1.M == 0 or
>   HCR_EL2.TGE = 0 or the effective value of HCR_EL2.DC* is 0.
> 
> From the description of HCR_EL2.TGE, when the bit is set:
> 
> 'If EL1 is using AArch64, the SCTLR_EL1.M field is treated as being 0 for all
> purposes other than returning the result of a direct read of SCTLR_EL1.'
> 
> From the description of HCR_EL2.DC, when the bit is set:
> 
> 'When EL1 is using AArch64, the PE behaves as if the value of the SCTLR_EL1.M
> field is 0 for all purposes other than returning the value of a direct read of
> SCTLR_EL1.'
> 
> *The description of HCR_EL2.DC states:
> 
> 'When the Effective value of HCR_EL2.{E2H,TGE} is {1,1}, this field behaves as
> 0 for all purposes other than a direct read of the value of this field.'
> 
> But if {E2H,TGE} = {1,1} then the regime is Regime_EL20, which ignores the DC
> bit.  If we're looking at the DC bit, then that means that the regime is EL10,
> ({E2H,TGE} != {1,1} in compute_translation_regime()) and the effective value of
> HCR_EL2.DC is the same as the DC bit.

Yup. That's what I've stashed on top:

@@ -136,12 +137,22 @@ static int setup_s1_walk(struct kvm_vcpu *vcpu, u32 op, struct s1_walk_info *wi,
 	va = (u64)sign_extend64(va, 55);
 
 	/* Let's put the MMU disabled case aside immediately */
-	if (!(sctlr & SCTLR_ELx_M) ||
-	    (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
+	switch (wi->regime) {
+	case TR_EL10:
+		if (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)
+			wr->level = S1_MMU_DISABLED;
+		fallthrough;
+	case TR_EL2:
+	case TR_EL20:
+		if (!(sctlr & SCTLR_ELx_M))
+			wr->level = S1_MMU_DISABLED;
+		break;
+	}
+
+	if (wr->level == S1_MMU_DISABLED) {
 		if (va >= BIT(kvm_get_pa_bits(vcpu->kvm)))
 			goto addrsz;
 
-		wr->level = S1_MMU_DISABLED;
 		wr->pa = va;
 		return 0;
 	}

[...]

> > +	/* R_PLCGL, R_YXNYW */
> > +	if ((!kvm_has_feat_enum(vcpu->kvm, ID_AA64MMFR2_EL1, ST, 48_47) &&
> > +	     wi->txsz > 39) ||
> > +	    (wi->pgshift == 16 && wi->txsz > 47) || wi->txsz > 48)
> > +		goto transfault_l0;
> 
> It's probably just me, but I find this hard to parse. There are three cases when
> the condition (!FEAT_TTST && TxSZ > 39) evaluates to false. But the other two
> conditions make sense to check only if !FEAT_TTST is false and wi->txsz > 39 is
> true.
> 
> I find this easier to read:
> 
> 	if (!kvm_has_feat_enum(vcpu->kvm, ID_AA64MMFR2_EL1, ST, 48_47)) {
> 		if (wi->txsz > 39)
> 			goto transfault_l0;
> 	} else {
> 		if (wi->txsz > 48 || (wi->pgshift == 16 && wi->txsz > 47))
> 			goto transfault_l0;
> 	}
> 
> What do you think?

Sure, happy to rewrite it like this if it helps.

[...]

> > +	/* R_YYVYV, I_THCZK */
> > +	if ((!va55 && va > GENMASK(ia_bits - 1, 0)) ||
> > +	    (va55 && va < GENMASK(63, ia_bits)))
> > +		goto transfault_l0;
> > +
> > +	/* R_ZFSYQ */
> 
> This is rather pedantic, but I think that should be I_ZFSYQ.

Well, these references are supposed to help. If they are incorrect,
they serve no purpose. Thanks for spotting this.

[...]

> > +		if (!(desc & 1) || ((desc & 3) == 1 && level == 3))
> > +			goto transfault;
> 
> The check for block mapping at level 3 is replicated below, when the level of
> the block is checked for correctnes.
> 
> Also, would you consider rewriting the valid descriptor check to
> (desc & BIT(0)), to match the block level check?
> 
> > +
> > +		/* We found a leaf, handle that */
> > +		if ((desc & 3) == 1 || level == 3)
> > +			break;
> 
> Here we know that (desc & 1), do you think rewriting the check to !(desc &
> BIT(1)), again matching the block level check, would be a good idea?

Other that the BIT() stuff, I'm not completely clear what you are
asking for. Are you objecting to the fact that the code is slightly
redundant? If so, I may be able to clean things up for you:

@@ -309,13 +323,19 @@ static int walk_s1(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
 		else
 			desc = le64_to_cpu((__force __le64)desc);
 
-		if (!(desc & 1) || ((desc & 3) == 1 && level == 3))
+		/* Invalid descriptor */
+		if (!(desc & BIT(0)))
 			goto transfault;
 
-		/* We found a leaf, handle that */
-		if ((desc & 3) == 1 || level == 3)
+		/* Block mapping, check validity down the line */
+		if (!(desc & BIT(1)))
+			break;
+
+		/* Page mapping */
+		if (level == 3)
 			break;
 
+		/* Table handling */
 		if (!wi->hpd) {
 			wr->APTable  |= FIELD_GET(S1_TABLE_AP, desc);
 			wr->UXNTable |= FIELD_GET(PMD_TABLE_UXN, desc);

Do you like this one better? Each bit only gets tested once.

[...]

> > +
> > +	if (check_output_size(desc & GENMASK(47, va_bottom), wi))
> > +		goto addrsz;
> > +
> > +	wr->failed = false;
> > +	wr->level = level;
> > +	wr->desc = desc;
> > +	wr->pa = desc & GENMASK(47, va_bottom);
> > +	if (va_bottom > 12)
> > +		wr->pa |= va & GENMASK_ULL(va_bottom - 1, 12);
> 
> I'm looking at StageOA(), and I don't think this matches what happens when the
> contiguous bit is set and the contiguous OA isn't aligned as per Table D8-98.
> Yes, I know, that's something super niche and unlikely to happen in practice.
> 
> Let's assume 4K pages, level = 3 (so va_bottom = 12), the first page in the
> contiguous OA range is 0x23_000 (so not aligned to 64K), and the input address
> that maps to the first page from the contiguous OA range is 0xf0_000 (aligned to
> 64K).
> 
> According to the present code:
> 
> wr->pa = 0x23_000
> 
> According to StageOA():
> 
> tsize  = 12
> csize  = 4
> oa     = 0x23_000 & GENMASK_ULL(47, 16) | 0xf0_000 & GENMASK_ULL(15, 0) = 0x20_000
> wr->pa = oa & ~GENMASK_ULL(11, 0) = 0x20_000
> 
> If the stage 1 is correctly programmed and the OA aligned to the required
> alignment, the two outputs match

Huh, that's another nice catch. I had the (dumb) idea that if we
didn't use the Contiguous bit as a TLB hint, we didn't need to do
anything special when it came to the alignment of the OA.

But clearly, the spec says that this alignment must be honoured, and
there is no way out of it. Which means that the S2 walker also has a
bit of a problem on that front.

> On a different topic, but still regarding wr->pa. I guess the code aligns wr->pa
> to 4K because that's how the OA in PAR_EL1 is reported.
> 
> Would it make more sense to have wr->pa represent the real output address, i.e,
> also contain the 12 least significant bits of the input address?  It wouldn't
> change how PAR_EL1 is computed (bits 11:0 are already masked out), but it would
> make wr->pa consistent with what the output address of a given input address
> should be (according to StageOA()).

Yup. That'd be consistent with the way wr->pa is reported when S1 is disabled.

I ended-up with this (untested):

@@ -354,12 +374,24 @@ static int walk_s1(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
 	if (check_output_size(desc & GENMASK(47, va_bottom), wi))
 		goto addrsz;
 
+	if (desc & PTE_CONT) {
+		switch (BIT(wi->pgshift)) {
+		case SZ_4K:
+			va_bottom += 4;
+			break;
+		case SZ_16K:
+			va_bottom += level == 2 ? 5 : 7;
+			break;
+		case SZ_64K:
+			va_bottom += 5;
+		}
+	}
+
 	wr->failed = false;
 	wr->level = level;
 	wr->desc = desc;
 	wr->pa = desc & GENMASK(47, va_bottom);
-	if (va_bottom > 12)
-		wr->pa |= va & GENMASK_ULL(va_bottom - 1, 12);
+	wr->pa |= va & GENMASK_ULL(va_bottom - 1, 0);
 
 	return 0;
 

Note that I'm still checking for the address size before the
contiguous bit alignment, as per R_JHQPP.

[...]

> > +		if (!(__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> > +			par |= FIELD_PREP(SYS_PAR_EL1_ATTR, 0); /* nGnRnE */
> > +			par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b10); /* OS */
> 
> s/0b10/ATTR_OSH ?
> 
> > +		} else {
> > +			par |= FIELD_PREP(SYS_PAR_EL1_ATTR,
> > +					  MEMATTR(WbRaWa, WbRaWa));
> > +			par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b00); /* NS */
> 
> s/0b00/ATTR_NSH ?
> 
> > +		}
> 
> HCR_EL2.DC applies only to Regime_EL10, I think the bit should be ignored for
> the EL2 and EL20 regimes.

Yup, now fixed.

> 
> > +	} else {
> > +		u64 mair, sctlr;
> > +		u8 sh;
> > +
> > +		mair = (regime == TR_EL10 ?
> > +			vcpu_read_sys_reg(vcpu, MAIR_EL1) :
> > +			vcpu_read_sys_reg(vcpu, MAIR_EL2));
> > +
> > +		mair >>= FIELD_GET(PTE_ATTRINDX_MASK, wr->desc) * 8;
> > +		mair &= 0xff;
> > +
> > +		sctlr = (regime == TR_EL10 ?
> > +			 vcpu_read_sys_reg(vcpu, SCTLR_EL1) :
> > +			 vcpu_read_sys_reg(vcpu, SCTLR_EL2));
> > +
> > +		/* Force NC for memory if SCTLR_ELx.C is clear */
> > +		if (!(sctlr & SCTLR_EL1_C) && !MEMATTR_IS_DEVICE(mair))
> > +			mair = MEMATTR(NC, NC);
> > +
> > +		par  = FIELD_PREP(SYS_PAR_EL1_ATTR, mair);
> > +		par |= wr->pa & GENMASK_ULL(47, 12);
> > +
> > +		sh = compute_sh(mair, wr->desc);
> > +		par |= FIELD_PREP(SYS_PAR_EL1_SH, sh);
> > +	}
> 
> When PAR_EL1.F = 0 and !FEAT_RME, bit 11 (NSE) is RES1, according to the
> description of the register and EncodePAR().

Fixed.

[...]

> >  void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
> >  {
> >  	u64 par = __kvm_at_s1e01_fast(vcpu, op, vaddr);
> >  
> > +	/*
> > +	 * If we see a permission fault at S1, then the fast path
> > +	 * succeedded. By failing. Otherwise, take a walk on the wild
> > +	 * side.
> 
> This is rather ambiguous. Maybe something along the lines would be more helpful:
> 
> 'If PAR_EL1 encodes a permission fault, we know for sure that the hardware
> translation table walker was able to walk the stage 1 tables and there's nothing
> else that KVM needs to do. On the other hand, if the AT instruction failed for
> any other reason, then KVM must walk the guest stage 1 tables (and possibly the
> virtual stage 2 tables) to emulate the instruction.'

Sure. I've adopted a slightly less verbose version of this:

@@ -930,9 +966,12 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
 	u64 par = __kvm_at_s1e01_fast(vcpu, op, vaddr);
 
 	/*
-	 * If we see a permission fault at S1, then the fast path
-	 * succeedded. By failing. Otherwise, take a walk on the wild
-	 * side.
+	 * If PAR_EL1 reports that AT failed on a S1 permission fault, we
+	 * know for sure that the PTW was able to walk the S1 tables and
+	 * there's nothing else to do.
+	 *
+	 * If AT failed for any other reason, then we must walk the guest S1
+	 * to emulate the instruction.
 	 */
 	if ((par & SYS_PAR_EL1_F) && !par_check_s1_perm_fault(par))
 		par = handle_at_slow(vcpu, op, vaddr);


I'll retest things over the weekend and post a new version early next
week.

Thanks again for your review, this is awesome work!

	M.

-- 
Without deviation from the norm, progress is not possible.




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux