Hey Akihiko, Thanks for having had a look at this. A bunch of comments below. On Fri, 02 Dec 2022 09:18:56 +0000, Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> wrote: > > M2 MacBook Air has mismatched CCSIDR associativity bits among physical > CPUs, which makes the bits a KVM vCPU sees inconsistent when migrating > among them. Your machine *does not* have any mismatched CCSIDR. By definition, any CPU can have any cache hierarchy, and there is no architectural requirement that they are all the same. I'd rather you describe this in architectural terms, and simply point out that KVM exposes the physical topology of the CPU the vcpu runs on (including across migration, which is a problem), and that userspace sees some arbitrary topology that has been sampled at boot time. And both behaviours are a bit wrong in an asymmetric system. This also break live migration for something that should never be a concern of non-secure SW. > > While it is possible to detect CCSIDR associativity bit mismatches and > mask them with that condition, it requires mismatch detection and > increases complexity. Instead, always mask the CCSIDR associativity bits > to keep the code simple. Given the above, this paragraph doesn't make much sense. > > Also, allow the userspace to overwrite the bits with arbitrary values so > that it can restore a vCPU state saved with an older kernel. > > Signed-off-by: Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> > Suggested-by: Marc Zyngier <maz@xxxxxxxxxx> > --- > arch/arm64/include/asm/kvm_arm.h | 3 +- > arch/arm64/include/asm/kvm_emulate.h | 4 - > arch/arm64/include/asm/kvm_host.h | 4 + > arch/arm64/include/asm/sysreg.h | 3 + > arch/arm64/kvm/sys_regs.c | 146 ++++++++++++++------------- > 5 files changed, 87 insertions(+), 73 deletions(-) > > diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h > index 8aa8492dafc0..f69cd96a65ab 100644 > --- a/arch/arm64/include/asm/kvm_arm.h > +++ b/arch/arm64/include/asm/kvm_arm.h > @@ -81,11 +81,12 @@ > * SWIO: Turn set/way invalidates into set/way clean+invalidate > * PTW: Take a stage2 fault if a stage1 walk steps in device memory > * TID3: Trap EL1 reads of group 3 ID registers > + * TID2: Trap CCSIDR_EL1 Not only that, but also CTR_EL0, CCSIDR2_EL1, CLIDR_EL1, and CSSELR_EL1 if the guest is using AArch64, and CCSELR if the guest is using AArch32. > */ > #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \ > HCR_BSU_IS | HCR_FB | HCR_TACR | \ > HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \ > - HCR_FMO | HCR_IMO | HCR_PTW | HCR_TID3 ) > + HCR_FMO | HCR_IMO | HCR_PTW | HCR_TID3 | HCR_TID2) > #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF) > #define HCR_HOST_NVHE_FLAGS (HCR_RW | HCR_API | HCR_APK | HCR_ATA) > #define HCR_HOST_NVHE_PROTECTED_FLAGS (HCR_HOST_NVHE_FLAGS | HCR_TSC) > diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h > index 9bdba47f7e14..30c4598d643b 100644 > --- a/arch/arm64/include/asm/kvm_emulate.h > +++ b/arch/arm64/include/asm/kvm_emulate.h > @@ -88,10 +88,6 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu) > if (vcpu_el1_is_32bit(vcpu)) > vcpu->arch.hcr_el2 &= ~HCR_RW; > > - if (cpus_have_const_cap(ARM64_MISMATCHED_CACHE_TYPE) || > - vcpu_el1_is_32bit(vcpu)) > - vcpu->arch.hcr_el2 |= HCR_TID2; > - > if (kvm_has_mte(vcpu->kvm)) > vcpu->arch.hcr_el2 |= HCR_ATA; > } > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > index 45e2136322ba..cc051cd56179 100644 > --- a/arch/arm64/include/asm/kvm_host.h > +++ b/arch/arm64/include/asm/kvm_host.h > @@ -53,6 +53,9 @@ > > #define KVM_HAVE_MMU_RWLOCK > > +/* CSSELR values; used to index KVM_REG_ARM_DEMUX_ID_CCSIDR */ > +#define CSSELR_MAX 14 > + > /* > * Mode of operation configurable with kvm-arm.mode early param. > * See Documentation/admin-guide/kernel-parameters.txt for more information. > @@ -266,6 +269,7 @@ struct kvm_cpu_context { > struct user_fpsimd_state fp_regs; > > u64 sys_regs[NR_SYS_REGS]; > + u32 ccsidr[CSSELR_MAX + 1]; kvm_cpu_context is the wrong location for this stuff. We use it for things that get actively context-switched. No such thing here, as this is RO data as far as the guest is concerned. Also, it would probably make some sense to only allocate this memory if the vcpu is not using the default synthesised topology, but something that userspace has restored. > > struct kvm_vcpu *__hyp_running_vcpu; > }; > diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h > index 7d301700d1a9..0c5f3675b4c2 100644 > --- a/arch/arm64/include/asm/sysreg.h > +++ b/arch/arm64/include/asm/sysreg.h > @@ -941,6 +941,9 @@ > #define HFGxTR_EL2_nSMPRI_EL1_SHIFT 54 > #define HFGxTR_EL2_nSMPRI_EL1_MASK BIT_MASK(HFGxTR_EL2_nSMPRI_EL1_SHIFT) > > +/* CCSIDR_EL1 bit definitions */ > +#define CCSIDR_EL1_ASSOCIATIVITY_BITS GENMASK(27, 3) > + > #define ARM64_FEATURE_FIELD_BITS 4 > > /* Create a mask for the feature bits of the specified feature. */ > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c > index f4a7c5abcbca..3518d021d3a0 100644 > --- a/arch/arm64/kvm/sys_regs.c > +++ b/arch/arm64/kvm/sys_regs.c > @@ -84,24 +84,6 @@ void vcpu_write_sys_reg(struct kvm_vcpu *vcpu, u64 val, int reg) > /* 3 bits per cache level, as per CLIDR, but non-existent caches always 0 */ > static u32 cache_levels; > > -/* CSSELR values; used to index KVM_REG_ARM_DEMUX_ID_CCSIDR */ > -#define CSSELR_MAX 14 > - > -/* Which cache CCSIDR represents depends on CSSELR value. */ > -static u32 get_ccsidr(u32 csselr) > -{ > - u32 ccsidr; > - > - /* Make sure noone else changes CSSELR during this! */ > - local_irq_disable(); > - write_sysreg(csselr, csselr_el1); > - isb(); > - ccsidr = read_sysreg(ccsidr_el1); > - local_irq_enable(); > - > - return ccsidr; > -} > - > /* > * See note at ARMv7 ARM B1.14.4 (TL;DR: S/W ops are not easily virtualized). > */ > @@ -1300,25 +1282,76 @@ static bool access_ccsidr(struct kvm_vcpu *vcpu, struct sys_reg_params *p, > return write_to_read_only(vcpu, p, r); > > csselr = vcpu_read_sys_reg(vcpu, CSSELR_EL1); > - p->regval = get_ccsidr(csselr); > + p->regval = vcpu->arch.ctxt.ccsidr[csselr]; > > - /* > - * Guests should not be doing cache operations by set/way at all, and > - * for this reason, we trap them and attempt to infer the intent, so > - * that we can flush the entire guest's address space at the appropriate > - * time. > - * To prevent this trapping from causing performance problems, let's > - * expose the geometry of all data and unified caches (which are > - * guaranteed to be PIPT and thus non-aliasing) as 1 set and 1 way. > - * [If guests should attempt to infer aliasing properties from the > - * geometry (which is not permitted by the architecture), they would > - * only do so for virtually indexed caches.] > - */ > - if (!(csselr & 1)) // data or unified cache > - p->regval &= ~GENMASK(27, 3); See my early comment about not allocating that memory if we get a chance to always return the default view. > return true; > } > > +static bool is_valid_cache(u32 val) > +{ > + u32 level, ctype; > + > + if (val >= CSSELR_MAX) > + return false; > + > + /* Bottom bit is Instruction or Data bit. Next 3 bits are level. */ > + level = (val >> 1); > + ctype = (cache_levels >> (level * 3)) & 7; Err, I didn't expect this, see below. > + > + switch (ctype) { > + case 0: /* No cache */ > + return false; > + case 1: /* Instruction cache only */ > + return (val & 1); > + case 2: /* Data cache only */ > + case 4: /* Unified cache */ > + return !(val & 1); > + case 3: /* Separate instruction and data caches */ > + return true; > + default: /* Reserved: we can't know instruction or data. */ > + return false; > + } > +} > + > +static void reset_ccsidr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r) > +{ > + u32 ccsidr; > + int i; > + > + /* Make sure noone else changes CSSELR during this! */ > + local_irq_disable(); > + > + for (i = 0; i < CSSELR_MAX; i++) { > + if (!is_valid_cache(i)) > + continue; So you still base the default guest topology on the HW one. This wasn't what I had in mind. I really want KVM to expose something that is totally decoupled from the HW topology. I was thinking of a very basic two level hierarchy: - L0 instruction - L0 data - L1 unified with everything being one set, one way. And that's it. The advantage of such a hierarchy is that it works with both situations where CLIDR_EL1.LoC==0 or CLIDR_EL1.{LoUU,LoUIS}={0,0} and those that are less able, as long as CLIDR_EL1 and CTR_EL0 don't over-promise. Of course, it means that you need to make CLIDR_EL1 consistent with this hierarchy while respecting what the HW offers. > + > + /* Which cache CCSIDR represents depends on CSSELR value. */ > + write_sysreg(i, csselr_el1); > + isb(); > + ccsidr = read_sysreg(ccsidr_el1); > + > + /* > + * Guests should not be doing cache operations by set/way at > + * all, and for this reason, we trap them and attempt to infer > + * the intent, so that we can flush the entire guest's address > + * space at the appropriate time. > + * To prevent this trapping from causing performance problems, > + * let's expose the geometry of all data and unified caches > + * (which are guaranteed to be PIPT and thus non-aliasing) as > + * 1 set and 1 way. > + * [If guests should attempt to infer aliasing properties from > + * the geometry (which is not permitted by the architecture), > + * they would only do so for virtually indexed caches.] > + * > + * This also make sure vCPU see the consistent geometry even if > + * it migrates among phyiscal CPUs with different geometries. > + */ > + vcpu->arch.ctxt.ccsidr[i] = ccsidr & ~CCSIDR_EL1_ASSOCIATIVITY_BITS; > + } > + > + local_irq_enable(); > +} > + > static unsigned int mte_visibility(const struct kvm_vcpu *vcpu, > const struct sys_reg_desc *rd) > { > @@ -1603,7 +1636,7 @@ static const struct sys_reg_desc sys_reg_descs[] = { > > { SYS_DESC(SYS_CNTKCTL_EL1), NULL, reset_val, CNTKCTL_EL1, 0}, > > - { SYS_DESC(SYS_CCSIDR_EL1), access_ccsidr }, > + { SYS_DESC(SYS_CCSIDR_EL1), access_ccsidr, reset_ccsidr }, You also need to handle CCSIDR2_EL1/CCSIDR2 and make them UNDEF. This probably also mean you need to mask out FEAT_CCIDX from the relevant ID registers. > { SYS_DESC(SYS_CLIDR_EL1), access_clidr }, CLIDR_EL1 needs some extra love to return something that is consistent with whatever has been restored. Which mean it must now be handled as something userspace can write to. There is also some subtleties around the handling of MTE caches (I think we can make that "Unified Allocation Tag and Data cache, Allocation Tags and Data in separate lines."). I appreciate this is probably a bit (a lot?) more than what you had in mind, but please hang in there. We'll help you along the way. And if anything seem odd, please shout. I'll do my best to make it clearer. Thanks, M. -- Without deviation from the norm, progress is not possible. _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm