Hi Marc, Apologies that reply later because of out of office some days. 在 2020/10/23 17:17, Marc Zyngier 写道: > On 2020-10-23 02:26, Shaokun Zhang wrote: >> Hi Marc, >> >> 在 2020/10/22 20:03, Marc Zyngier 写道: >>> On 2020-10-22 02:57, Shaokun Zhang wrote: >>>> From: Nianyao Tang <tangnianyao@xxxxxxxxxx> >>>> >>>> Now HCR_EL2.FB is set to force broadcast tlbi to all online pcpus, >>>> even those vcpu do not resident on. It would get worse as system >>>> gets larger and more pcpus get online. >>>> Let's disable force-broadcast. We flush tlbi when move vcpu to a >>>> new pcpu, in case old page mapping still exists on new pcpu. >>>> >>>> Cc: Marc Zyngier <maz@xxxxxxxxxx> >>>> Signed-off-by: Nianyao Tang <tangnianyao@xxxxxxxxxx> >>>> Signed-off-by: Shaokun Zhang <zhangshaokun@xxxxxxxxxxxxx> >>>> --- >>>> arch/arm64/include/asm/kvm_arm.h | 2 +- >>>> arch/arm64/kvm/arm.c | 4 +++- >>>> 2 files changed, 4 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h >>>> index 64ce29378467..f85ea9c649cb 100644 >>>> --- a/arch/arm64/include/asm/kvm_arm.h >>>> +++ b/arch/arm64/include/asm/kvm_arm.h >>>> @@ -75,7 +75,7 @@ >>>> * PTW: Take a stage2 fault if a stage1 walk steps in device memory >>>> */ >>>> #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \ >>>> - HCR_BSU_IS | HCR_FB | HCR_TAC | \ >>>> + HCR_BSU_IS | HCR_TAC | \ >>>> HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \ >>>> HCR_FMO | HCR_IMO | HCR_PTW ) >>>> #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF) >>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c >>>> index acf9a993dfb6..845be911f885 100644 >>>> --- a/arch/arm64/kvm/arm.c >>>> +++ b/arch/arm64/kvm/arm.c >>>> @@ -334,8 +334,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) >>>> /* >>>> * We might get preempted before the vCPU actually runs, but >>>> * over-invalidation doesn't affect correctness. >>>> + * Dirty tlb might still exist when vcpu ran on other pcpu >>>> + * and modified page mapping. >>>> */ >>>> - if (*last_ran != vcpu->vcpu_id) { >>>> + if (*last_ran != vcpu->vcpu_id || vcpu->cpu != cpu) { >>>> kvm_call_hyp(__kvm_tlb_flush_local_vmid, mmu); >>>> *last_ran = vcpu->vcpu_id; >>>> } >>> >>> This breaks uniprocessor semantics for I-cache invalidation. What could >> >> Oops, It shall consider this. >> >>> possibly go wrong? You also fail to provide any data that would back up >>> your claim, as usual. >> >> Test Unixbench spawn in each 4U8G vm on Kunpeng 920: >> (1) 24 4U8G VMs, each vcpu is taskset to one pcpu to make sure >> each vm seperated. > > That's a very narrow use case. > >> (2) 1 4U8G VM >> Result: >> (1) 800 * 24 >> (2) 3200 > > I don't know what these numbers are. These are unixbench scores, the higher the better. > >> By restricting DVM broadcasting across NUMA, result (1) can >> be improved to 2300 * 24. In spawn testcase, tlbiis used >> in creating process. > > Linux always use TLBI IS, except in some very rare cases. > If your HW broadcasts invalidations across the whole system > without any filtering, it is bound to be bad. > >> Further, we consider restricting tlbi broadcast range and make >> tlbi more accurate. >> One scheme is replacing tlbiis with ipi + tlbi + HCR_EL2.FB=0. > > Ah. That old thing again. No, thank you. The right thing to do > is to build CPUs and interconnects that properly deals with > IS invalidations by not sending DVM messages to places where > things don't run (snoop filters?). I also doubt the arm64 > maintainers would be sympathetic to the idea anyway. We also do the same test on intel 6248 and get better result, less performance drop in multi-vm case compare to single vm. Intel use ipi + flush tlb scenario, do you think this scenario is meaningful on Arm architect hardware? > >> Consider I-cache invalidation, KVM also needs to invalid icache >> when moving vcpu to a new pcpu. >> Do we miss any cases that need HCR_EL2.FB == 1? >> Eventually we expect HCR_EL2.FB == 0 if it is possible. > > I have no intention to ever set FB to 0, as this would resu> in over-invalidation in the general case, and worse performance The reason that we want to disable FB is that IPI solution is needed if it is accepted. Thanks, Shaokun > on well designed systems. > > M. _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm