Remote flushing api's does a busy wait which is fine in bare-metal scenario. But with-in the guest, the vcpus might have been pre-empted or blocked. In this scenario, the initator vcpu would end up busy-waiting for a long amount of time. This was discovered in our gang scheduling test and other way to solve this is by para-virtualizing the flush_tlb_others_ipi(now shows up as smp_call_function_many after Alex Shi's TLB optimization) This patch set implements para-virt flush tlbs making sure that it does not wait for vcpus that are sleeping. And all the sleeping vcpus flush the tlb on guest enter. Idea was discussed here: https://lkml.org/lkml/2012/2/20/157 This also brings one more dependency for lock-less page walk that is performed by get_user_pages_fast(gup_fast). gup_fast disables the interrupt and assumes that the pages will not be freed during that period. And this was fine as the flush_tlb_others_ipi would wait for all the IPI to be processed and return back. With the new approach of not waiting for the sleeping vcpus, this assumption is not valid anymore. So now HAVE_RCU_TABLE_FREE is used to free the pages. This will make sure that all the cpus would atleast process smp_callback before the pages are freed. Changelog from v2: • Rebase to 3.5 based linus(commit - f7da9cd) kernel. • Port PV-Flush to new TLB-Optimization code by Alex Shi • Use pinned pages to avoid overhead during guest enter/exit (Marcelo) • Remove kick, as this is not improving much • Use bit fields in the state(flush_on_enter and vcpu_running) flag to avoid smp barriers (Marcelo) • Add documentation for Paravirt TLB Flush (Marcelo) Changelog from v1: • Race fixes reported by Vatsa • Address gup_fast dependency using PeterZ's rcu table free patch • Fix rcu_table_free for hw pagetable walkers Here are the results from PLE hardware. Here is the setup details: • 32 CPUs (HT disabled) • 64-bit VM • 32vcpus • 8GB RAM Base: f7da9cd (based on 3.5 kernel, includes rik's changes and alex shi's changes) ple-opt: Raghu's PLE improvements [1](in kvm:auto-next now) pv3flsh: ple-opt + paravirt flush v3 Lower is better kbench - 1VM ============ Avg Stddev base 16.714089 1.2471967 pleopt 12.527411 0.15261886 pv3flsh 12.955556 0.5041832 kbench - 2VM ============ Avg Stddev base 28.565933 3.0167804 pleopt 22.7613 1.9046476 pv3flsh 23.034083 2.2192968 Higher is better ebizzy - 1VM ============ Avg Stddev base 1091 21.674358 pleopt 2239 45.188494 pv3flsh 2170.7 44.592102 ebizzy - 2VM ============ Avg Stddev base 1824.7 63.708299 pleopt 2383.2 107.46779 pv3flsh 2328.2 69.359172 Observations: ------------- Looking at the results above, ple-opt[1] patches have addressed the remote-flush-tlb issue that we were trying to address using the paravirt-tlb-flush approach. [1] http://article.gmane.org/gmane.linux.kernel/1329752 --- Nikunj A. Dadhania (6): KVM Guest: Add VCPU running/pre-empted state for guest KVM-HV: Add VCPU running/pre-empted state for guest KVM Guest: Add paravirt kvm_flush_tlb_others KVM-HV: Add flush_on_enter before guest enter Enable HAVE_RCU_TABLE_FREE for kvm when PARAVIRT_TLB_FLUSH is enabled KVM-doc: Add paravirt tlb flush document Peter Zijlstra (2): mm, x86: Add HAVE_RCU_TABLE_FREE support mm: Add missing TLB invalidate to RCU page-table freeing Documentation/virtual/kvm/msr.txt | 4 + Documentation/virtual/kvm/paravirt-tlb-flush.txt | 53 +++++++++++++++++++ arch/Kconfig | 3 + arch/powerpc/Kconfig | 1 arch/sparc/Kconfig | 1 arch/x86/Kconfig | 11 ++++ arch/x86/include/asm/kvm_host.h | 7 ++ arch/x86/include/asm/kvm_para.h | 13 +++++ arch/x86/include/asm/tlb.h | 1 arch/x86/include/asm/tlbflush.h | 11 ++++ arch/x86/kernel/kvm.c | 38 +++++++++++++ arch/x86/kvm/cpuid.c | 1 arch/x86/kvm/x86.c | 62 +++++++++++++++++++++- arch/x86/mm/pgtable.c | 6 +- arch/x86/mm/tlb.c | 37 +++++++++++++ include/asm-generic/tlb.h | 9 +++ mm/memory.c | 43 +++++++++++++-- 17 files changed, 290 insertions(+), 11 deletions(-) create mode 100644 Documentation/virtual/kvm/paravirt-tlb-flush.txt -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html