> On Jul 3, 2019, at 10:43 AM, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote: > > On 03/07/2019 18:02, Nadav Amit wrote: >>> On Jul 3, 2019, at 7:04 AM, Juergen Gross <jgross@xxxxxxxx> wrote: >>> >>> On 03.07.19 01:51, Nadav Amit wrote: >>>> To improve TLB shootdown performance, flush the remote and local TLBs >>>> concurrently. Introduce flush_tlb_multi() that does so. Introduce >>>> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen >>>> and hyper-v are only compile-tested). >>>> While the updated smp infrastructure is capable of running a function on >>>> a single local core, it is not optimized for this case. The multiple >>>> function calls and the indirect branch introduce some overhead, and >>>> might make local TLB flushes slower than they were before the recent >>>> changes. >>>> Before calling the SMP infrastructure, check if only a local TLB flush >>>> is needed to restore the lost performance in this common case. This >>>> requires to check mm_cpumask() one more time, but unless this mask is >>>> updated very frequently, this should impact performance negatively. >>>> Cc: "K. Y. Srinivasan" <kys@xxxxxxxxxxxxx> >>>> Cc: Haiyang Zhang <haiyangz@xxxxxxxxxxxxx> >>>> Cc: Stephen Hemminger <sthemmin@xxxxxxxxxxxxx> >>>> Cc: Sasha Levin <sashal@xxxxxxxxxx> >>>> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> >>>> Cc: Ingo Molnar <mingo@xxxxxxxxxx> >>>> Cc: Borislav Petkov <bp@xxxxxxxxx> >>>> Cc: x86@xxxxxxxxxx >>>> Cc: Juergen Gross <jgross@xxxxxxxx> >>>> Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx> >>>> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> >>>> Cc: Andy Lutomirski <luto@xxxxxxxxxx> >>>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> >>>> Cc: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx> >>>> Cc: linux-hyperv@xxxxxxxxxxxxxxx >>>> Cc: linux-kernel@xxxxxxxxxxxxxxx >>>> Cc: virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx >>>> Cc: kvm@xxxxxxxxxxxxxxx >>>> Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx >>>> Signed-off-by: Nadav Amit <namit@xxxxxxxxxx> >>>> --- >>>> arch/x86/hyperv/mmu.c | 13 +++--- >>>> arch/x86/include/asm/paravirt.h | 6 +-- >>>> arch/x86/include/asm/paravirt_types.h | 4 +- >>>> arch/x86/include/asm/tlbflush.h | 9 ++-- >>>> arch/x86/include/asm/trace/hyperv.h | 2 +- >>>> arch/x86/kernel/kvm.c | 11 +++-- >>>> arch/x86/kernel/paravirt.c | 2 +- >>>> arch/x86/mm/tlb.c | 65 ++++++++++++++++++++------- >>>> arch/x86/xen/mmu_pv.c | 20 ++++++--- >>>> include/trace/events/xen.h | 2 +- >>>> 10 files changed, 91 insertions(+), 43 deletions(-) >>> ... >>> >>>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c >>>> index beb44e22afdf..19e481e6e904 100644 >>>> --- a/arch/x86/xen/mmu_pv.c >>>> +++ b/arch/x86/xen/mmu_pv.c >>>> @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long addr) >>>> preempt_enable(); >>>> } >>>> -static void xen_flush_tlb_others(const struct cpumask *cpus, >>>> - const struct flush_tlb_info *info) >>>> +static void xen_flush_tlb_multi(const struct cpumask *cpus, >>>> + const struct flush_tlb_info *info) >>>> { >>>> struct { >>>> struct mmuext_op op; >>>> @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus, >>>> const size_t mc_entry_size = sizeof(args->op) + >>>> sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus()); >>>> - trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end); >>>> + trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end); >>>> if (cpumask_empty(cpus)) >>>> return; /* nothing to do */ >>>> @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct cpumask *cpus, >>>> args = mcs.args; >>>> args->op.arg2.vcpumask = to_cpumask(args->mask); >>>> - /* Remove us, and any offline CPUS. */ >>>> + /* Flush locally if needed and remove us */ >>>> + if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) { >>>> + local_irq_disable(); >>>> + flush_tlb_func_local(info); >>> I think this isn't the correct function for PV guests. >>> >>> In fact it should be much easier: just don't clear the own cpu from the >>> mask, that's all what's needed. The hypervisor is just fine having the >>> current cpu in the mask and it will do the right thing. >> Thanks. I will do so in v3. I don’t think Hyper-V people would want to do >> the same, unfortunately, since it would induce VM-exit on TLB flushes. > > Why do you believe the vmexit matters? You're talking one anyway for > the IPI. > > Intel only have virtualised self-IPI, and while AMD do have working > non-self IPIs, you still take a vmexit anyway if any destination vcpu > isn't currently running in non-root mode (IIRC). > > At that point, you might as well have the hypervisor do all the hard > work via a multi-cpu shootdown/flush hypercall, rather than trying to > arrange it locally. I forgot that xen_flush_tlb_multi() should actually only be called when there are some remote CPUs (as I optimized the case in which there is only a single local CPU that needs to be flushed), so you are right.