On 13/08/21 22:34, David Matlack wrote:
This series avoids kvm_vcpu_gfn_to_memslot() calls during page fault handling by passing around the memslot in struct kvm_page_fault. This idea came from Ben Gardon who authored an similar series in Google's kernel. This series is an RFC because kvm_vcpu_gfn_to_memslot() calls are actually quite cheap after commit fe22ed827c5b ("KVM: Cache the last used slot index per vCPU") since we always hit the cache. However profiling shows there is still some time (1-2%) spent in kvm_vcpu_gfn_to_memslot() and that hot instructions are the memory loads for kvm->memslots[as_id] and slots->used_slots. This series eliminates this remaining overhead but at the cost of a bit of code churn. Design ------ We can avoid the cost of kvm_vcpu_gfn_to_memslot() by looking up the slot once and passing it around. In fact this is quite easy to do now that KVM passes around struct kvm_page_fault to most of the page fault handling code. We can store the slot there without changing most of the call sites. The one exception to this is mmu_set_spte, which does not take a kvm_page_fault since it is also used during spte prefetching. There are three memslots lookups under mmu_set_spte: mmu_set_spte rmap_add kvm_vcpu_gfn_to_memslot rmap_recycle kvm_vcpu_gfn_to_memslot set_spte make_spte mmu_try_to_unsync_pages kvm_page_track_is_active kvm_vcpu_gfn_to_memslot Avoiding these lookups requires plumbing the slot through all of the above functions. I explored creating a synthetic kvm_page_fault for prefetching so that kvm_page_fault could be passed to all of these functions instead, but that resulted in even more code churn. Patches ------- Patches 1-2 are small cleanups related to the series. Patches 3-4 pass the memslot through kvm_page_fault and use it where kvm_page_fault is already accessible. Patches 5-6 plumb the memslot down into the guts of mmu_set_spte to avoid the remaining memslot lookups. Performance ----------- I measured the performance using dirty_log_perf_test and taking the average "Populate memory time" over 10 runs. To help inform whether or not different parts of this series is worth the code churn I measured the performance of pages 1-4 and 1-6 separately. Test | tdp_mmu | kvm/queue | Patches 1-4 | Patches 1-6 ------------------------------- | ------- | --------- | ----------- | ----------- ./dirty_log_perf_test -v64 | Y | 5.22s | 5.20s | 5.20s ./dirty_log_perf_test -v64 -x64 | Y | 5.23s | 5.14s | 5.14s ./dirty_log_perf_test -v64 | N | 17.14s | 16.39s | 15.36s ./dirty_log_perf_test -v64 -x64 | N | 17.17s | 16.60s | 15.31s This series provides no performance improvement to the tdp_mmu but improves the legacy MMU page fault handling by about 10%. David Matlack (6): KVM: x86/mmu: Rename try_async_pf to kvm_faultin_pfn in comment KVM: x86/mmu: Fold rmap_recycle into rmap_add KVM: x86/mmu: Pass around the memslot in kvm_page_fault KVM: x86/mmu: Avoid memslot lookup in page_fault_handle_page_track KVM: x86/mmu: Avoid memslot lookup in rmap_add KVM: x86/mmu: Avoid memslot lookup in mmu_try_to_unsync_pages arch/x86/include/asm/kvm_page_track.h | 4 +- arch/x86/kvm/mmu.h | 5 +- arch/x86/kvm/mmu/mmu.c | 110 +++++++++----------------- arch/x86/kvm/mmu/mmu_internal.h | 3 +- arch/x86/kvm/mmu/page_track.c | 6 +- arch/x86/kvm/mmu/paging_tmpl.h | 18 ++++- arch/x86/kvm/mmu/spte.c | 11 +-- arch/x86/kvm/mmu/spte.h | 9 ++- arch/x86/kvm/mmu/tdp_mmu.c | 12 +-- 9 files changed, 80 insertions(+), 98 deletions(-)
Queued patches 1-3, thanks. For the others, see the reply to patch 6. Paolo