Re: [PATCH v2 13/17] KVM: arm64: nv: Add SW walker for AT S1 emulation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Marc,

On Sat, Aug 10, 2024 at 11:16:15AM +0100, Marc Zyngier wrote:
> Hi Alex,
> 
> On Fri, 09 Aug 2024 13:43:33 +0100,
> Alexandru Elisei <alexandru.elisei@xxxxxxx> wrote:
> > 
> > Hi Marc,
> > 
> > Finally managed to go through the patch. Bunch of nitpicks from me (can be
> > safely ignored), and some corner cases where KVM deviates from the spec.
> 
> Thanks for taking the time to go through this mess.
> 
> [...]
> 
> > > +		break;
> > > +	default:
> > > +		return (vcpu_el2_e2h_is_set(vcpu) &&
> > > +			vcpu_el2_tge_is_set(vcpu)) ? TR_EL20 : TR_EL10;
> > 
> > This also looks correct to me. Following the pseudocode was not trivial, so I'm
> > leaving this here in case I made a mistake somewhere.
> > 
> > For the S1E0* variants: AArch64.AT(el_in=EL0) => TranslationRegime(el=EL0) =>
> > ELIsInHost(el=EL0); ELIsInHost(el=EL0) is true if {E2H,TGE} = {1,1}, and in this
> > case TranslationRegime(el=EL0) returns Regime_EL20, otherwise Regime_EL10.
> > 
> > For the S1E1* variants: AArch64.AT(el_in=EL1), where:
> > 
> > - if ELIsInHost(el=EL0) is true, then 'el' is changed to EL2, where
> >   ELIsInHost(el=EL0) is true if {E2H,TGE} = {1,1}. In this case,
> >   TranslationRegime(el=EL2) will return Regime_EL20.
> > 
> > - if ELIsInHost(el=EL0) is false, then 'el' remains EL1, and
> >   TranslationRegime(el=EL1) returns Regime_EL10.
> 
> Yup. Somehow, the C version is far easier to understand!
> 
> > 
> > > +	}
> > > +}
> > > +
> > > +static int setup_s1_walk(struct kvm_vcpu *vcpu, u32 op, struct s1_walk_info *wi,
> > > +			 struct s1_walk_result *wr, u64 va)
> > > +{
> > > +	u64 sctlr, tcr, tg, ps, ia_bits, ttbr;
> > > +	unsigned int stride, x;
> > > +	bool va55, tbi, lva, as_el0;
> > > +
> > > +	wi->regime = compute_translation_regime(vcpu, op);
> > > +	as_el0 = (op == OP_AT_S1E0R || op == OP_AT_S1E0W);
> > > +
> > > +	va55 = va & BIT(55);
> > > +
> > > +	if (wi->regime == TR_EL2 && va55)
> > > +		goto addrsz;
> > > +
> > > +	wi->s2 = wi->regime == TR_EL10 && (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_VM);
> > 
> > According to AArch64.NSS2TTWParams(), stage 2 is enabled if HCR_EL2.VM or
> > HCR_EL2.DC.
> > 
> > From the description of HCR_EL2.DC, when the bit is set:
> > 
> > 'The PE behaves as if the value of the HCR_EL2.VM field is 1 for all purposes
> > other than returning the value of a direct read of HCR_EL2.'
> 
> Yup, good point. And as you noticed further down, the HCR_EL2.DC
> handling is currently a mess.
> 
> [...]
> 
> > > +	/* Let's put the MMU disabled case aside immediately */
> > > +	if (!(sctlr & SCTLR_ELx_M) ||
> > > +	    (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> > 
> > I think the condition doesn't match AArch64.S1Enabled():
> > 
> > - if regime is Regime_EL20 or Regime_EL2, then S1 is disabled if and only if
> >   SCTLR_EL2.M is not set; it doesn't matter if HCR_EL2.DC is set, because "[..]
> >   this field has no effect on the EL2, EL2&0, and EL3 translation regimes."
> >   (HCR_EL2.DC bit field description).
> > 
> > - if regime is Regime_EL10, then S1 is disabled if SCTLR_EL1.M == 0 or
> >   HCR_EL2.TGE = 0 or the effective value of HCR_EL2.DC* is 0.
> > 
> > From the description of HCR_EL2.TGE, when the bit is set:
> > 
> > 'If EL1 is using AArch64, the SCTLR_EL1.M field is treated as being 0 for all
> > purposes other than returning the result of a direct read of SCTLR_EL1.'
> > 
> > From the description of HCR_EL2.DC, when the bit is set:
> > 
> > 'When EL1 is using AArch64, the PE behaves as if the value of the SCTLR_EL1.M
> > field is 0 for all purposes other than returning the value of a direct read of
> > SCTLR_EL1.'
> > 
> > *The description of HCR_EL2.DC states:
> > 
> > 'When the Effective value of HCR_EL2.{E2H,TGE} is {1,1}, this field behaves as
> > 0 for all purposes other than a direct read of the value of this field.'
> > 
> > But if {E2H,TGE} = {1,1} then the regime is Regime_EL20, which ignores the DC
> > bit.  If we're looking at the DC bit, then that means that the regime is EL10,
> > ({E2H,TGE} != {1,1} in compute_translation_regime()) and the effective value of
> > HCR_EL2.DC is the same as the DC bit.
> 
> Yup. That's what I've stashed on top:
> 
> @@ -136,12 +137,22 @@ static int setup_s1_walk(struct kvm_vcpu *vcpu, u32 op, struct s1_walk_info *wi,
>  	va = (u64)sign_extend64(va, 55);
>  
>  	/* Let's put the MMU disabled case aside immediately */
> -	if (!(sctlr & SCTLR_ELx_M) ||
> -	    (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> +	switch (wi->regime) {
> +	case TR_EL10:
> +		if (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)
> +			wr->level = S1_MMU_DISABLED;

In compute_translation_regime(), for AT instructions other than AT S1E2*, when
{E2H,TGE} = {0,1}, regime is Regime_EL10. As far as I can tell, when regime is
Regime_EL10 and TGE is set, stage 1 is disabled, according to
AArch64.S1Enabled() and the decription of the TGE bit.

> +		fallthrough;
> +	case TR_EL2:
> +	case TR_EL20:
> +		if (!(sctlr & SCTLR_ELx_M))
> +			wr->level = S1_MMU_DISABLED;
> +		break;
> +	}
> +
> +	if (wr->level == S1_MMU_DISABLED) {
>  		if (va >= BIT(kvm_get_pa_bits(vcpu->kvm)))
>  			goto addrsz;
>  
> -		wr->level = S1_MMU_DISABLED;
>  		wr->pa = va;
>  		return 0;
>  	}
> 
> [...]
> 
> > > +	/* R_PLCGL, R_YXNYW */
> > > +	if ((!kvm_has_feat_enum(vcpu->kvm, ID_AA64MMFR2_EL1, ST, 48_47) &&
> > > +	     wi->txsz > 39) ||
> > > +	    (wi->pgshift == 16 && wi->txsz > 47) || wi->txsz > 48)
> > > +		goto transfault_l0;
> > 
> > It's probably just me, but I find this hard to parse. There are three cases when
> > the condition (!FEAT_TTST && TxSZ > 39) evaluates to false. But the other two
> > conditions make sense to check only if !FEAT_TTST is false and wi->txsz > 39 is
> > true.
> > 
> > I find this easier to read:
> > 
> > 	if (!kvm_has_feat_enum(vcpu->kvm, ID_AA64MMFR2_EL1, ST, 48_47)) {
> > 		if (wi->txsz > 39)
> > 			goto transfault_l0;
> > 	} else {
> > 		if (wi->txsz > 48 || (wi->pgshift == 16 && wi->txsz > 47))
> > 			goto transfault_l0;
> > 	}
> > 
> > What do you think?
> 
> Sure, happy to rewrite it like this if it helps.

Cool, thanks.

> 
> [...]
> 
> > > +	/* R_YYVYV, I_THCZK */
> > > +	if ((!va55 && va > GENMASK(ia_bits - 1, 0)) ||
> > > +	    (va55 && va < GENMASK(63, ia_bits)))
> > > +		goto transfault_l0;
> > > +
> > > +	/* R_ZFSYQ */
> > 
> > This is rather pedantic, but I think that should be I_ZFSYQ.
> 
> Well, these references are supposed to help. If they are incorrect,
> they serve no purpose. Thanks for spotting this.
> 
> [...]
> 
> > > +		if (!(desc & 1) || ((desc & 3) == 1 && level == 3))
> > > +			goto transfault;
> > 
> > The check for block mapping at level 3 is replicated below, when the level of
> > the block is checked for correctnes.
> > 
> > Also, would you consider rewriting the valid descriptor check to
> > (desc & BIT(0)), to match the block level check?
> > 
> > > +
> > > +		/* We found a leaf, handle that */
> > > +		if ((desc & 3) == 1 || level == 3)
> > > +			break;
> > 
> > Here we know that (desc & 1), do you think rewriting the check to !(desc &
> > BIT(1)), again matching the block level check, would be a good idea?
> 
> Other that the BIT() stuff, I'm not completely clear what you are
> asking for. Are you objecting to the fact that the code is slightly
> redundant? If so, I may be able to clean things up for you:

Yes, I was referring to the fact that the code is slightly redundant.

> 
> @@ -309,13 +323,19 @@ static int walk_s1(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
>  		else
>  			desc = le64_to_cpu((__force __le64)desc);
>  
> -		if (!(desc & 1) || ((desc & 3) == 1 && level == 3))
> +		/* Invalid descriptor */
> +		if (!(desc & BIT(0)))
>  			goto transfault;
>  
> -		/* We found a leaf, handle that */
> -		if ((desc & 3) == 1 || level == 3)
> +		/* Block mapping, check validity down the line */
> +		if (!(desc & BIT(1)))
> +			break;
> +
> +		/* Page mapping */
> +		if (level == 3)
>  			break;
>  
> +		/* Table handling */
>  		if (!wi->hpd) {
>  			wr->APTable  |= FIELD_GET(S1_TABLE_AP, desc);
>  			wr->UXNTable |= FIELD_GET(PMD_TABLE_UXN, desc);
> 
> Do you like this one better? Each bit only gets tested once.

	switch (desc & GENMASK_ULL(1, 0)) {
	case 0b00:
	case 0b10:
		goto transfault;
	case 0b01:
		/* Block mapping */
		break;
	default:
		if (level == 3)
			break;
	}

Is this better? Perhaps slightly easier to match against the descriptor layouts,
but I'm not sure it's an improvement over your suggestion. Up to you, no point
in bikeshedding over it.

> 
> [...]
> 
> > > +
> > > +	if (check_output_size(desc & GENMASK(47, va_bottom), wi))
> > > +		goto addrsz;
> > > +
> > > +	wr->failed = false;
> > > +	wr->level = level;
> > > +	wr->desc = desc;
> > > +	wr->pa = desc & GENMASK(47, va_bottom);
> > > +	if (va_bottom > 12)
> > > +		wr->pa |= va & GENMASK_ULL(va_bottom - 1, 12);
> > 
> > I'm looking at StageOA(), and I don't think this matches what happens when the
> > contiguous bit is set and the contiguous OA isn't aligned as per Table D8-98.
> > Yes, I know, that's something super niche and unlikely to happen in practice.
> > 
> > Let's assume 4K pages, level = 3 (so va_bottom = 12), the first page in the
> > contiguous OA range is 0x23_000 (so not aligned to 64K), and the input address
> > that maps to the first page from the contiguous OA range is 0xf0_000 (aligned to
> > 64K).
> > 
> > According to the present code:
> > 
> > wr->pa = 0x23_000
> > 
> > According to StageOA():
> > 
> > tsize  = 12
> > csize  = 4
> > oa     = 0x23_000 & GENMASK_ULL(47, 16) | 0xf0_000 & GENMASK_ULL(15, 0) = 0x20_000
> > wr->pa = oa & ~GENMASK_ULL(11, 0) = 0x20_000
> > 
> > If the stage 1 is correctly programmed and the OA aligned to the required
> > alignment, the two outputs match
> 
> Huh, that's another nice catch. I had the (dumb) idea that if we
> didn't use the Contiguous bit as a TLB hint, we didn't need to do
> anything special when it came to the alignment of the OA.
> 
> But clearly, the spec says that this alignment must be honoured, and
> there is no way out of it. Which means that the S2 walker also has a
> bit of a problem on that front.
> 
> > On a different topic, but still regarding wr->pa. I guess the code aligns wr->pa
> > to 4K because that's how the OA in PAR_EL1 is reported.
> > 
> > Would it make more sense to have wr->pa represent the real output address, i.e,
> > also contain the 12 least significant bits of the input address?  It wouldn't
> > change how PAR_EL1 is computed (bits 11:0 are already masked out), but it would
> > make wr->pa consistent with what the output address of a given input address
> > should be (according to StageOA()).
> 
> Yup. That'd be consistent with the way wr->pa is reported when S1 is disabled.
> 
> I ended-up with this (untested):
> 
> @@ -354,12 +374,24 @@ static int walk_s1(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
>  	if (check_output_size(desc & GENMASK(47, va_bottom), wi))
>  		goto addrsz;
>  
> +	if (desc & PTE_CONT) {
> +		switch (BIT(wi->pgshift)) {
> +		case SZ_4K:
> +			va_bottom += 4;
> +			break;
> +		case SZ_16K:
> +			va_bottom += level == 2 ? 5 : 7;
> +			break;
> +		case SZ_64K:
> +			va_bottom += 5;
> +		}
> +	}
> +

Matches ContiguousSize().

>  	wr->failed = false;
>  	wr->level = level;
>  	wr->desc = desc;
>  	wr->pa = desc & GENMASK(47, va_bottom);
> -	if (va_bottom > 12)
> -		wr->pa |= va & GENMASK_ULL(va_bottom - 1, 12);
> +	wr->pa |= va & GENMASK_ULL(va_bottom - 1, 0);

Matches StageOA().

>  
>  	return 0;
>  
> 
> Note that I'm still checking for the address size before the
> contiguous bit alignment, as per R_JHQPP.

Nicely spotted.

> 
> [...]
> 
> > > +		if (!(__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> > > +			par |= FIELD_PREP(SYS_PAR_EL1_ATTR, 0); /* nGnRnE */
> > > +			par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b10); /* OS */
> > 
> > s/0b10/ATTR_OSH ?
> > 
> > > +		} else {
> > > +			par |= FIELD_PREP(SYS_PAR_EL1_ATTR,
> > > +					  MEMATTR(WbRaWa, WbRaWa));
> > > +			par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b00); /* NS */
> > 
> > s/0b00/ATTR_NSH ?
> > 
> > > +		}
> > 
> > HCR_EL2.DC applies only to Regime_EL10, I think the bit should be ignored for
> > the EL2 and EL20 regimes.
> 
> Yup, now fixed.
> 
> > 
> > > +	} else {
> > > +		u64 mair, sctlr;
> > > +		u8 sh;
> > > +
> > > +		mair = (regime == TR_EL10 ?
> > > +			vcpu_read_sys_reg(vcpu, MAIR_EL1) :
> > > +			vcpu_read_sys_reg(vcpu, MAIR_EL2));
> > > +
> > > +		mair >>= FIELD_GET(PTE_ATTRINDX_MASK, wr->desc) * 8;
> > > +		mair &= 0xff;
> > > +
> > > +		sctlr = (regime == TR_EL10 ?
> > > +			 vcpu_read_sys_reg(vcpu, SCTLR_EL1) :
> > > +			 vcpu_read_sys_reg(vcpu, SCTLR_EL2));
> > > +
> > > +		/* Force NC for memory if SCTLR_ELx.C is clear */
> > > +		if (!(sctlr & SCTLR_EL1_C) && !MEMATTR_IS_DEVICE(mair))
> > > +			mair = MEMATTR(NC, NC);
> > > +
> > > +		par  = FIELD_PREP(SYS_PAR_EL1_ATTR, mair);
> > > +		par |= wr->pa & GENMASK_ULL(47, 12);
> > > +
> > > +		sh = compute_sh(mair, wr->desc);
> > > +		par |= FIELD_PREP(SYS_PAR_EL1_SH, sh);
> > > +	}
> > 
> > When PAR_EL1.F = 0 and !FEAT_RME, bit 11 (NSE) is RES1, according to the
> > description of the register and EncodePAR().
> 
> Fixed.
> 
> [...]
> 
> > >  void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
> > >  {
> > >  	u64 par = __kvm_at_s1e01_fast(vcpu, op, vaddr);
> > >  
> > > +	/*
> > > +	 * If we see a permission fault at S1, then the fast path
> > > +	 * succeedded. By failing. Otherwise, take a walk on the wild
> > > +	 * side.
> > 
> > This is rather ambiguous. Maybe something along the lines would be more helpful:
> > 
> > 'If PAR_EL1 encodes a permission fault, we know for sure that the hardware
> > translation table walker was able to walk the stage 1 tables and there's nothing
> > else that KVM needs to do. On the other hand, if the AT instruction failed for
> > any other reason, then KVM must walk the guest stage 1 tables (and possibly the
> > virtual stage 2 tables) to emulate the instruction.'
> 
> Sure. I've adopted a slightly less verbose version of this:
> 
> @@ -930,9 +966,12 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
>  	u64 par = __kvm_at_s1e01_fast(vcpu, op, vaddr);
>  
>  	/*
> -	 * If we see a permission fault at S1, then the fast path
> -	 * succeedded. By failing. Otherwise, take a walk on the wild
> -	 * side.
> +	 * If PAR_EL1 reports that AT failed on a S1 permission fault, we
> +	 * know for sure that the PTW was able to walk the S1 tables and
> +	 * there's nothing else to do.
> +	 *
> +	 * If AT failed for any other reason, then we must walk the guest S1
> +	 * to emulate the instruction.

Looks good.

Thanks,
Alex

>  	 */
>  	if ((par & SYS_PAR_EL1_F) && !par_check_s1_perm_fault(par))
>  		par = handle_at_slow(vcpu, op, vaddr);
> 
> 
> I'll retest things over the weekend and post a new version early next
> week.
> 
> Thanks again for your review, this is awesome work!
> 
> 	M.
> 
> -- 
> Without deviation from the norm, progress is not possible.




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux