On 04/16/2018 06:09 AM, David Gibson wrote: > On Thu, Apr 12, 2018 at 05:02:06PM +1000, Sam Bobroff wrote: >> It is not currently possible to create the full number of possible >> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less >> threads per core than it's core stride (or "VSMT mode"). This is >> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS >> even though the VCPU ID is less than KVM_MAX_VCPU_ID. >> >> To address this, "pack" the VCORE ID and XIVE offsets by using >> knowledge of the way the VCPU IDs will be used when there are less >> guest threads per core than the core stride. The primary thread of >> each core will always be used first. Then, if the guest uses more than >> one thread per core, these secondary threads will sequentially follow >> the primary in each core. >> >> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the >> VCPUs are being spaced apart, so at least half of each core is empty >> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped >> into the second half of each core (4..7, in an 8-thread core). >> >> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of >> each core is being left empty, and we can map down into the second and >> third quarters of each core (2, 3 and 5, 6 in an 8-thread core). >> >> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary >> threads are being used and 7/8 of the core is empty, allowing use of >> the 1, 3, 5 and 7 thread slots. >> >> (Strides less than 8 are handled similarly.) >> >> This allows the VCORE ID or offset to be calculated quickly from the >> VCPU ID or XIVE server numbers, without access to the VCPU structure. >> >> Signed-off-by: Sam Bobroff <sam.bobroff@xxxxxxxxxxx> >> --- >> Hello everyone, >> >> I've tested this on P8 and P9, in lots of combinations of host and guest >> threading modes and it has been fine but it does feel like a "tricky" >> approach, so I still feel somewhat wary about it. Have you done any migration ? >> I've posted it as an RFC because I have not tested it with guest native-XIVE, >> and I suspect that it will take some work to support it. The KVM XIVE device will be different for XIVE exploitation mode, same structures though. I will send a patchset shortly. >> arch/powerpc/include/asm/kvm_book3s.h | 19 +++++++++++++++++++ >> arch/powerpc/kvm/book3s_hv.c | 14 ++++++++++---- >> arch/powerpc/kvm/book3s_xive.c | 9 +++++++-- >> 3 files changed, 36 insertions(+), 6 deletions(-) >> >> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h >> index 376ae803b69c..1295056d564a 100644 >> --- a/arch/powerpc/include/asm/kvm_book3s.h >> +++ b/arch/powerpc/include/asm/kvm_book3s.h >> @@ -368,4 +368,23 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu); >> #define SPLIT_HACK_MASK 0xff000000 >> #define SPLIT_HACK_OFFS 0xfb000000 >> >> +/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the >> + * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core stride >> + * (but not it's actual threading mode, which is not available) to avoid >> + * collisions. >> + */ >> +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id) >> +{ >> + const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 5, 3, 7}; > > I'd suggest 1,3,5,7 at the end rather than 1,5,3,7 - accomplishes > roughly the same thing, but I think makes the pattern more obvious. > >> + int stride = kvm->arch.emul_smt_mode > 1 ? >> + kvm->arch.emul_smt_mode : kvm->arch.smt_mode; > > AFAICT from BUG_ON()s etc. at the callsites, kvm->arch.smt_mode must > always be 1 when this is called, so the conditional here doesn't seem > useful. > >> + int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride); >> + u32 packed_id; >> + >> + BUG_ON(block >= MAX_SMT_THREADS); >> + packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block]; >> + BUG_ON(packed_id >= KVM_MAX_VCPUS); >> + return packed_id; >> +} > > It took me a while to wrap my head around the packing function, but I > think I got there in the end. It's pretty clever. > > One thing bothers me, though. This certainly packs things under > KVM_MAX_VCPUS, but not necessarily under the actual number of vcpus. > e.g. KVM_MAC_VCPUS==16, 8 vcpus total, stride 8, 2 vthreads/vcore (as > qemu sees it), gives both unpacked IDs (0, 1, 8, 9, 16, 17, 24, 25) > and packed ids of (0, 1, 8, 9, 4, 5, 12, 13) - leaving 2, 3, 6, 7 > etc. unused. > > So again, the question is what exactly are these remapped IDs useful > for. If we're indexing into a bare array of structures of size > KVM_MAX_VCPUS then we're *already* wasting a bunch of space by having > more entries than vcpus. If we're indexing into something sparser, > then why is the remapping worthwhile? > > > >> + >> #endif /* __ASM_KVM_BOOK3S_H__ */ >> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c >> index 9cb9448163c4..49165cc90051 100644 >> --- a/arch/powerpc/kvm/book3s_hv.c >> +++ b/arch/powerpc/kvm/book3s_hv.c >> @@ -1762,7 +1762,7 @@ static int threads_per_vcore(struct kvm *kvm) >> return threads_per_subcore; >> } >> >> -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core) >> +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id) >> { >> struct kvmppc_vcore *vcore; >> >> @@ -1776,7 +1776,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core) >> init_swait_queue_head(&vcore->wq); >> vcore->preempt_tb = TB_NIL; >> vcore->lpcr = kvm->arch.lpcr; >> - vcore->first_vcpuid = core * kvm->arch.smt_mode; >> + vcore->first_vcpuid = id; >> vcore->kvm = kvm; >> INIT_LIST_HEAD(&vcore->preempt_list); >> >> @@ -1992,12 +1992,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm, >> mutex_lock(&kvm->lock); >> vcore = NULL; >> err = -EINVAL; >> - core = id / kvm->arch.smt_mode; >> + if (cpu_has_feature(CPU_FTR_ARCH_300)) { >> + BUG_ON(kvm->arch.smt_mode != 1); >> + core = kvmppc_pack_vcpu_id(kvm, id); >> + } else { >> + core = id / kvm->arch.smt_mode; >> + } >> if (core < KVM_MAX_VCORES) { >> vcore = kvm->arch.vcores[core]; >> + BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore); >> if (!vcore) { >> err = -ENOMEM; >> - vcore = kvmppc_vcore_create(kvm, core); >> + vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1)); >> kvm->arch.vcores[core] = vcore; >> kvm->arch.online_vcores++; >> } >> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c >> index f9818d7d3381..681dfe12a5f3 100644 >> --- a/arch/powerpc/kvm/book3s_xive.c >> +++ b/arch/powerpc/kvm/book3s_xive.c >> @@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio) >> return -EBUSY; >> } >> >> +static u32 xive_vp(struct kvmppc_xive *xive, u32 server) >> +{ >> + return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server); >> +} >> + > > I'm finding the XIVE indexing really baffling. There are a bunch of > other places where the code uses (xive->vp_base + NUMBER) directly. This links the QEMU vCPU server NUMBER to a XIVE virtual processor number in OPAL. So we need to check that all used NUMBERs are, first, consistent and then, in the correct range. > If those are host side references, I guess they don't need updates for > this. > But if that's the case, then how does indexing into the same array > with both host and guest server numbers make sense? yes. VPs are allocated with KVM_MAX_VCPUS : xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS); but #define KVM_MAX_VCPU_ID (threads_per_subcore * KVM_MAX_VCORES) WE would need to change the allocation of the VPs I guess. >> static u8 xive_lock_and_mask(struct kvmppc_xive *xive, >> struct kvmppc_xive_src_block *sb, >> struct kvmppc_xive_irq_state *state) >> @@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev, >> pr_devel("Duplicate !\n"); >> return -EEXIST; >> } >> - if (cpu >= KVM_MAX_VCPUS) { >> + if (cpu >= KVM_MAX_VCPU_ID) {>> >> pr_devel("Out of bounds !\n"); >> return -EINVAL; >> } >> @@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev, >> xc->xive = xive; >> xc->vcpu = vcpu; >> xc->server_num = cpu; >> - xc->vp_id = xive->vp_base + cpu; >> + xc->vp_id = xive_vp(xive, cpu); >> xc->mfrr = 0xff; >> xc->valid = true; >> >