Hi Marc, 在 2020/10/22 20:03, Marc Zyngier 写道: > On 2020-10-22 02:57, Shaokun Zhang wrote: >> From: Nianyao Tang <tangnianyao@xxxxxxxxxx> >> >> Now HCR_EL2.FB is set to force broadcast tlbi to all online pcpus, >> even those vcpu do not resident on. It would get worse as system >> gets larger and more pcpus get online. >> Let's disable force-broadcast. We flush tlbi when move vcpu to a >> new pcpu, in case old page mapping still exists on new pcpu. >> >> Cc: Marc Zyngier <maz@xxxxxxxxxx> >> Signed-off-by: Nianyao Tang <tangnianyao@xxxxxxxxxx> >> Signed-off-by: Shaokun Zhang <zhangshaokun@xxxxxxxxxxxxx> >> --- >> arch/arm64/include/asm/kvm_arm.h | 2 +- >> arch/arm64/kvm/arm.c | 4 +++- >> 2 files changed, 4 insertions(+), 2 deletions(-) >> >> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h >> index 64ce29378467..f85ea9c649cb 100644 >> --- a/arch/arm64/include/asm/kvm_arm.h >> +++ b/arch/arm64/include/asm/kvm_arm.h >> @@ -75,7 +75,7 @@ >> * PTW: Take a stage2 fault if a stage1 walk steps in device memory >> */ >> #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \ >> - HCR_BSU_IS | HCR_FB | HCR_TAC | \ >> + HCR_BSU_IS | HCR_TAC | \ >> HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \ >> HCR_FMO | HCR_IMO | HCR_PTW ) >> #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF) >> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c >> index acf9a993dfb6..845be911f885 100644 >> --- a/arch/arm64/kvm/arm.c >> +++ b/arch/arm64/kvm/arm.c >> @@ -334,8 +334,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) >> /* >> * We might get preempted before the vCPU actually runs, but >> * over-invalidation doesn't affect correctness. >> + * Dirty tlb might still exist when vcpu ran on other pcpu >> + * and modified page mapping. >> */ >> - if (*last_ran != vcpu->vcpu_id) { >> + if (*last_ran != vcpu->vcpu_id || vcpu->cpu != cpu) { >> kvm_call_hyp(__kvm_tlb_flush_local_vmid, mmu); >> *last_ran = vcpu->vcpu_id; >> } > > This breaks uniprocessor semantics for I-cache invalidation. What could Oops, It shall consider this. > possibly go wrong? You also fail to provide any data that would back up > your claim, as usual. Test Unixbench spawn in each 4U8G vm on Kunpeng 920: (1) 24 4U8G VMs, each vcpu is taskset to one pcpu to make sure each vm seperated. (2) 1 4U8G VM Result: (1) 800 * 24 (2) 3200 By restricting DVM broadcasting across NUMA, result (1) can be improved to 2300 * 24. In spawn testcase, tlbiis used in creating process. Further, we consider restricting tlbi broadcast range and make tlbi more accurate. One scheme is replacing tlbiis with ipi + tlbi + HCR_EL2.FB=0. Consider I-cache invalidation, KVM also needs to invalid icache when moving vcpu to a new pcpu. Do we miss any cases that need HCR_EL2.FB == 1? Eventually we expect HCR_EL2.FB == 0 if it is possible. Thanks, Shaokun > > M. _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm