Re: [PATCH 13/59] KVM: arm64: nv: Handle virtual EL2 registers in vcpu_read/write_sys_reg()

Alexandru Elisei <alexandru.elisei@xxxxxxx> · Wed, 3 Jul 2019 17:32:48 +0100

On 7/3/19 4:59 PM, Marc Zyngier wrote:
> On 25/06/2019 16:18, Alexandru Elisei wrote:
>> Hi Marc,
>>
>> A question regarding this patch. This patch modifies vcpu_{read,write}_sys_reg
>> to handle virtual EL2 registers. However, the file kvm/emulate-nested.c added by
>> patch 10/59 "KVM: arm64: nv: Support virtual EL2 exceptions" already uses
>> vcpu_{read,write}_sys_reg to access EL2 registers. In my opinion, it doesn't
>> really matter which comes first because nested support is only enabled in the
>> last patch of the series, but I thought I should bring this up in case it is not
>> what you intended.
> It doesn't really matter at that stage. The only thing I'm trying to
> achieve in the middle of the series is not to break the build, and not
> to cause non-NV to fail.
>
>> On 6/21/19 10:37 AM, Marc Zyngier wrote:
>>> From: Andre Przywara <andre.przywara@xxxxxxx>
>>>
>>> KVM internally uses accessor functions when reading or writing the
>>> guest's system registers. This takes care of accessing either the stored
>>> copy or using the "live" EL1 system registers when the host uses VHE.
>>>
>>> With the introduction of virtual EL2 we add a bunch of EL2 system
>>> registers, which now must also be taken care of:
>>> - If the guest is running in vEL2, and we access an EL1 sysreg, we must
>>>   revert to the stored version of that, and not use the CPU's copy.
>>> - If the guest is running in vEL1, and we access an EL2 sysreg, we must
>>>   also use the stored version, since the CPU carries the EL1 copy.
>>> - Some EL2 system registers are supposed to affect the current execution
>>>   of the system, so we need to put them into their respective EL1
>>>   counterparts. For this we need to define a mapping between the two.
>>>   This is done using the newly introduced struct el2_sysreg_map.
>>> - Some EL2 system registers have a different format than their EL1
>>>   counterpart, so we need to translate them before writing them to the
>>>   CPU. This is done using an (optional) translate function in the map.
>>> - There are the three special registers SP_EL2, SPSR_EL2 and ELR_EL2,
>>>   which need some separate handling.
>> I see no change in this patch related to SPSR_EL2. Special handling of SPSR_EL2
>> is added in the next patch, patch 14/59 "KVM: arm64: nv: Handle SPSR_EL2 specially".
> Indeed, this needs rewriting (we ended-up splitting the SPSR stuff out
> as it was messy and not completely correct). I may take the rest of the
> special stuff out as well.
>
>>> All of these cases are now wrapped into the existing accessor functions,
>>> so KVM users wouldn't need to care whether they access EL2 or EL1
>>> registers and also which state the guest is in.
>>>
>>> This handles what was formerly known as the "shadow state" dynamically,
>>> without requiring a separate copy for each vCPU EL.
>>>
>>> Signed-off-by: Andre Przywara <andre.przywara@xxxxxxx>
>>> Signed-off-by: Marc Zyngier <marc.zyngier@xxxxxxx>
>>> ---
>>>  arch/arm64/include/asm/kvm_emulate.h |   6 +
>>>  arch/arm64/include/asm/kvm_host.h    |   5 +
>>>  arch/arm64/kvm/sys_regs.c            | 163 +++++++++++++++++++++++++++
>>>  3 files changed, 174 insertions(+)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>>> index c43aac5fed69..f37006b6eec4 100644
>>> --- a/arch/arm64/include/asm/kvm_emulate.h
>>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>>> @@ -70,6 +70,12 @@ void kvm_emulate_nested_eret(struct kvm_vcpu *vcpu);
>>>  int kvm_inject_nested_sync(struct kvm_vcpu *vcpu, u64 esr_el2);
>>>  int kvm_inject_nested_irq(struct kvm_vcpu *vcpu);
>>>  
>>> +u64 translate_tcr(u64 tcr);
>>> +u64 translate_cptr(u64 tcr);
>>> +u64 translate_sctlr(u64 tcr);
>>> +u64 translate_ttbr0(u64 tcr);
>>> +u64 translate_cnthctl(u64 tcr);
>>> +
>>>  static inline bool vcpu_el1_is_32bit(struct kvm_vcpu *vcpu)
>>>  {
>>>  	return !(vcpu->arch.hcr_el2 & HCR_RW);
>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>> index 2d4290d2513a..dae9c42a7219 100644
>>> --- a/arch/arm64/include/asm/kvm_host.h
>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>> @@ -217,6 +217,11 @@ enum vcpu_sysreg {
>>>  	NR_SYS_REGS	/* Nothing after this line! */
>>>  };
>>>  
>>> +static inline bool sysreg_is_el2(int reg)
>>> +{
>>> +	return reg >= FIRST_EL2_SYSREG && reg < NR_SYS_REGS;
>>> +}
>>> +
>>>  /* 32bit mapping */
>>>  #define c0_MPIDR	(MPIDR_EL1 * 2)	/* MultiProcessor ID Register */
>>>  #define c0_CSSELR	(CSSELR_EL1 * 2)/* Cache Size Selection Register */
>>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>>> index 693dd063c9c2..d024114da162 100644
>>> --- a/arch/arm64/kvm/sys_regs.c
>>> +++ b/arch/arm64/kvm/sys_regs.c
>>> @@ -76,11 +76,142 @@ static bool write_to_read_only(struct kvm_vcpu *vcpu,
>>>  	return false;
>>>  }
>>>  
>>> +static u64 tcr_el2_ips_to_tcr_el1_ps(u64 tcr_el2)
>> The code seems to suggest that you are translating TCR_EL2.PS to TCR_EL1.IPS.
>> Perhaps the function should be named tcr_el2_ps_to_tcr_el1_ips?
> yup.
>
>>> +{
>>> +	return ((tcr_el2 & TCR_EL2_PS_MASK) >> TCR_EL2_PS_SHIFT)
>>> +		<< TCR_IPS_SHIFT;
>>> +}
>>> +
>>> +u64 translate_tcr(u64 tcr)
>>> +{
>>> +	return TCR_EPD1_MASK |				/* disable TTBR1_EL1 */
>>> +	       ((tcr & TCR_EL2_TBI) ? TCR_TBI0 : 0) |
>>> +	       tcr_el2_ips_to_tcr_el1_ps(tcr) |
>>> +	       (tcr & TCR_EL2_TG0_MASK) |
>>> +	       (tcr & TCR_EL2_ORGN0_MASK) |
>>> +	       (tcr & TCR_EL2_IRGN0_MASK) |
>>> +	       (tcr & TCR_EL2_T0SZ_MASK);
>>> +}
>>> +
>>> +u64 translate_cptr(u64 cptr_el2)
>> The argument name is not consistent with the other translate_* functions. I
>> think it is reasonably obvious that you are translating an EL2 register.
> That's pretty much immaterial, and the variable could be called zorglub.
> Consistency is good, but I don't think we need to worry about that level
> of detail.

Sure.

>
>>> +{
>>> +	u64 cpacr_el1 = 0;
>>> +
>>> +	if (!(cptr_el2 & CPTR_EL2_TFP))
>>> +		cpacr_el1 |= CPACR_EL1_FPEN;
>>> +	if (cptr_el2 & CPTR_EL2_TTA)
>>> +		cpacr_el1 |= CPACR_EL1_TTA;
>>> +	if (!(cptr_el2 & CPTR_EL2_TZ))
>>> +		cpacr_el1 |= CPACR_EL1_ZEN;
>>> +
>>> +	return cpacr_el1;
>>> +}
>>> +
>>> +u64 translate_sctlr(u64 sctlr)
>>> +{
>>> +	/* Bit 20 is RES1 in SCTLR_EL1, but RES0 in SCTLR_EL2 */
>>> +	return sctlr | BIT(20);
>>> +}
>>> +
>>> +u64 translate_ttbr0(u64 ttbr0)
>>> +{
>>> +	/* Force ASID to 0 (ASID 0 or RES0) */
>> Are you forcing ASID to 0 because you are only expecting a non-vhe guest
>> hypervisor to access ttbr0_el2, in which case the architecture says that the
>> ASID field is RES0? Is it so unlikely that a vhe guest hypervisor will access
>> ttbr0_el2 directly that it's not worth adding a check for that?
> Like all the translate_* function, this is only called when running a
> non-VHE guest so that the EL2 register is translated to the EL1 format.
> A VHE guest usually has its sysregs in the EL1 format, and certainly
> does for TTBR0_EL2.

Yeah, figured that out after I sent this patch.

>
>>> +	return ttbr0 & ~GENMASK_ULL(63, 48);
>>> +}
>>> +
>>> +u64 translate_cnthctl(u64 cnthctl)
>>> +{
>>> +	return ((cnthctl & 0x3) << 10) | (cnthctl & 0xfc);
>> Patch 16/59 "KVM: arm64: nv: Save/Restore vEL2 sysregs" suggests that you are
>> translating CNTHCTL to write it to CNTKCTL_EL1. Looking at ARM DDI 0487D.b,
>> CNTKCTL_EL1 has bits 63:10 RES0. I think the correct value should be ((cnthctl &
>> 0x3) << 8) | (cnthctl & 0xfc).
> Rookie mistake! When HCR_EL2.E2h==1 (which is always the case for NV),
> CNTKCTL_EL1 accesses CNTHCTL_EL2. What you have here is the translation
> of non-VHE CNTHCTL_EL2 to its VHE equivalent.

Indeed! Thank you for pointing it out.

>
>>> +}
>>> +
>>> +#define EL2_SYSREG(el2, el1, translate)	\
>>> +	[el2 - FIRST_EL2_SYSREG] = { el2, el1, translate }
>>> +#define PURE_EL2_SYSREG(el2) \
>>> +	[el2 - FIRST_EL2_SYSREG] = { el2,__INVALID_SYSREG__, NULL }
>>> +/*
>>> + * Associate vEL2 registers to their EL1 counterparts on the CPU.
>>> + * The translate function can be NULL, when the register layout is identical.
>>> + */
>>> +struct el2_sysreg_map {
>>> +	int sysreg;	/* EL2 register index into the array above */
>>> +	int mapping;	/* associated EL1 register */
>>> +	u64 (*translate)(u64 value);
>>> +} nested_sysreg_map[NR_SYS_REGS - FIRST_EL2_SYSREG] = {
>>> +	PURE_EL2_SYSREG( VPIDR_EL2 ),
>>> +	PURE_EL2_SYSREG( VMPIDR_EL2 ),
>>> +	PURE_EL2_SYSREG( ACTLR_EL2 ),
>>> +	PURE_EL2_SYSREG( HCR_EL2 ),
>>> +	PURE_EL2_SYSREG( MDCR_EL2 ),
>>> +	PURE_EL2_SYSREG( HSTR_EL2 ),
>>> +	PURE_EL2_SYSREG( HACR_EL2 ),
>>> +	PURE_EL2_SYSREG( VTTBR_EL2 ),
>>> +	PURE_EL2_SYSREG( VTCR_EL2 ),
>>> +	PURE_EL2_SYSREG( RVBAR_EL2 ),
>>> +	PURE_EL2_SYSREG( RMR_EL2 ),
>>> +	PURE_EL2_SYSREG( TPIDR_EL2 ),
>>> +	PURE_EL2_SYSREG( CNTVOFF_EL2 ),
>>> +	PURE_EL2_SYSREG( CNTHCTL_EL2 ),
>>> +	PURE_EL2_SYSREG( HPFAR_EL2 ),
>>> +	EL2_SYSREG(      SCTLR_EL2,  SCTLR_EL1,      translate_sctlr ),
>>> +	EL2_SYSREG(      CPTR_EL2,   CPACR_EL1,      translate_cptr  ),
>>> +	EL2_SYSREG(      TTBR0_EL2,  TTBR0_EL1,      translate_ttbr0 ),
>>> +	EL2_SYSREG(      TTBR1_EL2,  TTBR1_EL1,      NULL            ),
>>> +	EL2_SYSREG(      TCR_EL2,    TCR_EL1,        translate_tcr   ),
>>> +	EL2_SYSREG(      VBAR_EL2,   VBAR_EL1,       NULL            ),
>>> +	EL2_SYSREG(      AFSR0_EL2,  AFSR0_EL1,      NULL            ),
>>> +	EL2_SYSREG(      AFSR1_EL2,  AFSR1_EL1,      NULL            ),
>>> +	EL2_SYSREG(      ESR_EL2,    ESR_EL1,        NULL            ),
>>> +	EL2_SYSREG(      FAR_EL2,    FAR_EL1,        NULL            ),
>>> +	EL2_SYSREG(      MAIR_EL2,   MAIR_EL1,       NULL            ),
>>> +	EL2_SYSREG(      AMAIR_EL2,  AMAIR_EL1,      NULL            ),
>>> +};
>> Figuring out which registers are in this map and which aren't and are supposed
>> to be treated differently is really cumbersome because they are split into two
>> types of el2 registers and their order is different from the order in enum
>> vcpu_sysreg (in kvm_host.h). Perhaps adding a comment about what registers will
>> be treated differently would make the code a bit easier to follow?
> I'm not sure what this buys us. We have 3 categories of EL2 sysregs:
> - Purely emulated
> - Directly mapped onto an EL1 sysreg
> - Translated from EL2 to EL1
>
> I think the wrappers represent that pretty well, although we could split
> EL2_SYSREG into DIRECT_EL2_SYSREG and TRANSLATE_EL2_SYSREG. As for the
> order, does it really matter? We also have the trap table order, which
> is also different from the enum. Do you propose we reorder everything?

The wrappers and the naming are fine.

I was trying to figure out which EL2 registers are in the nested_sysreg_map and
which aren't (that's what I meant by "two types of registers") by looking at the
vcpu_sysreg enum. Because the order in the map is different than the order in
the enum, I was having a difficult time figuring out which registers are not in
the nested_sysreg_map to make sure we haven't somehow forgot to emulate a register.

So no, I wasn't asking to reorder everything. I was asking if it would be
appropriate to write a comment stating the intention to treat registers X, Y and
Z separately from the registers in nested_sysreg_map.

>
>>> +
>>> +static
>>> +const struct el2_sysreg_map *find_el2_sysreg(const struct el2_sysreg_map *map,
>>> +					     int reg)
>>> +{
>>> +	const struct el2_sysreg_map *entry;
>>> +
>>> +	if (!sysreg_is_el2(reg))
>>> +		return NULL;
>>> +
>>> +	entry = &nested_sysreg_map[reg - FIRST_EL2_SYSREG];
>>> +	if (entry->sysreg == __INVALID_SYSREG__)
>>> +		return NULL;
>>> +
>>> +	return entry;
>>> +}
>>> +
>>>  u64 vcpu_read_sys_reg(const struct kvm_vcpu *vcpu, int reg)
>>>  {
>>> +
>>>  	if (!vcpu->arch.sysregs_loaded_on_cpu)
>>>  		goto immediate_read;
>>>  
>>> +	if (unlikely(sysreg_is_el2(reg))) {
>>> +		const struct el2_sysreg_map *el2_reg;
>>> +
>>> +		if (!is_hyp_ctxt(vcpu))
>>> +			goto immediate_read;
>> I'm confused by this. is_hyp_ctxt returns false when the guest is not in vEL2
>> AND HCR_EL.E2H or HCR_EL2.TGE are not set. In this case, the NV bit will not be
>> set and the hardware will raise an undefined instruction exception when
>> accessing an EL2 register from EL1. What am I missing?
> You don't necessarily access an EL2 register just because you run at
> EL2. You also access it because you emulate an EL1 instruction whose
> behaviour is conditioned by an EL2 register.

Got it, now it makes a lot more sense.

>
> Thanks,
>
> 	M.