This series adds an option to cause stage-2 fault handlers to KVM_MEMORY_FAULT_EXIT when they would otherwise be required to fault in the userspace mappings. Doing so allows userspace to receive stage-2 faults directly from KVM_RUN instead of through userfaultfd, which suffers from serious contention issues as the number of vCPUs scales. Support for the new option (KVM_CAP_EXIT_ON_MISSING) is added to the demand_paging_test, which demonstrates the scalability improvements: the following data was collected using [2] on an x86 machine with 256 cores. vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps) 1 150 340 2 191 477 4 210 809 8 155 1239 16 130 1595 32 108 2299 64 86 3482 128 62 4134 256 36 4012 TODO ~~~~ No known issues/things to resolve. However, documentation/commit logs merit a close look given how much feedback I've received on those :/ Base Commit ~~~~~~~~~~~ This series is based off of kvm/next (45b890f7689e) with v14 of the guest_memfd series applied, with some fixes on top [3]. Links ~~~~~ [1] Original RFC from James Houghton: https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@xxxxxxxxxxxxxx/ [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w] A quick rundown of the new flags (also detailed in later commits) -a registers all of guest memory to a single uffd. -r species the number of reader threads for polling the uffd. -w is what actually enables the new capabilities. All data was collected after applying the entire series [3] https://lore.kernel.org/kvm/20231105163040.14904-1-pbonzini@xxxxxxxxxx/T/#m56361120ee1dd5265a5710e6a814906cda8e1020 The following fixes are required to get the KVM selftests to compile on arm64 - https://lore.kernel.org/kvm/20231108233723.3380042-1-amoorthy@xxxxxxxxxx/ - https://lore.kernel.org/kvm/affca7a8-116e-4b0f-9edf-6cdc05ba65ca@xxxxxxxxxx/ - Unguarding the definitions of MEM_REGION_GPA/SLOT in set_memory_region_test (not sure if this is the "right" fix for that test, but it compiles) --- v6 - Rebase onto guest_memfd series [Anish/Sean] - Set write fault flag properly in user_mem_abort() [Oliver] - Reformat unnecessarily multi-line comments [Sean] - Drop the kvm_vcpu_read|write_guest_page() annotations [Sean] - Rename *USERFAULT_ON_MISSING to *EXIT_ON_MISSING [David] - Remove unnecessary rounding in user_mem_abort() annotation [David] - Rewrite logs for KVM_MEM_EXIT_ON_MISSING patches and squash them with the stage-2 fault annotation patches [Sean] - Undo the enum parameter addition to __gfn_to_pfn_memslot(), and just add another boolean parameter instead [Sean] - Better shortlog for the hva_to_pfn_fast() change [Anish] v5: https://lore.kernel.org/kvm/20230908222905.1321305-1-amoorthy@xxxxxxxxxx/ - Rename APIs (again) [Sean] - Initialize hardware_exit_reason along w/ exit_reason on x86 [Isaku] - Reword hva_to_pfn_fast() change commit message [Sean] - Correct style on terminal if statements [Sean] - Switch to kconfig to signal KVM_CAP_USERFAULT_ON_MISSING [Sean] - Add read fault flag for annotated faults [Sean] - read/write_guest_page() changes - Move the annotations into vcpu wrapper fns [Sean] - Reorder parameters [Robert] - Rename kvm_populate_efault_info() to kvm_handle_guest_uaccess_fault() [Sean] - Remove unnecessary EINVAL on trying to enable memory fault info cap [Sean] - Correct description of the faults which hva_to_pfn_fast() can now resolve [Sean] - Eliminate unnecessary parameter added to __kvm_faultin_pfn() [Sean] - Magnanimously accept Sean's rewrite of the handle_error_pfn() annotation [Anish] - Remove vcpu null check from kvm_handle_guest_uaccess_fault [Sean] v4: https://lore.kernel.org/kvm/20230602161921.208564-1-amoorthy@xxxxxxxxxx/T/#t - Fix excessive indentation [Robert, Oliver] - Calculate final stats when uffd handler fn returns an error [Robert] - Remove redundant info from uffd_desc [Robert] - Fix various commit message typos [Robert] - Add comment about suppressed EEXISTs in selftest [Robert] - Add exit_reasons_known definition for KVM_EXIT_MEMORY_FAULT [Robert] - Fix some include/logic issues in self test [Robert] - Rename no-slow-gup cap to KVM_CAP_NOWAIT_ON_FAULT [Oliver, Sean] - Make KVM_CAP_MEMORY_FAULT_INFO informational-only [Oliver, Sean] - Drop most of the annotations from v3: see https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@xxxxxxxxxx/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf - Remove WARN on bare efaults [Sean, Oliver] - Eliminate unnecessary UFFDIO_WAKE call from self test [James] v3: https://lore.kernel.org/kvm/ZEBXi5tZZNxA+jRs@x1n/T/#t - Rework the implementation to be based on two orthogonal capabilities (KVM_CAP_MEMORY_FAULT_INFO and KVM_CAP_NOWAIT_ON_FAULT) [Sean, Oliver] - Change return code of kvm_populate_efault_info [Isaku] - Use kvm_populate_efault_info from arm code [Oliver] v2: https://lore.kernel.org/kvm/20230315021738.1151386-1-amoorthy@xxxxxxxxxx/ This was a bit of a misfire, as I sent my WIP series on the mailing list but was just targeting Sean for some feedback. Oliver Upton and Isaku Yamahata ended up discovering the series and giving me some feedback anyways, so thanks to them :) In the end, there was enough discussion to justify retroactively labeling it as v2, even with the limited cc list. - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT. - API changes: - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such requirement). - Switched to memslot flag - Take Oliver's simplification to the "allow fast gup for readable faults" logic. - Slightly redefine the return code of user_mem_abort. - Fix documentation errors brought up by Marc - Reword commit messages in imperative mood v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@xxxxxxxxxx/ Anish Moorthy (14): KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter KVM: Documentation: Add docstrings for __kvm_read/write_guest_page() KVM: Simplify error handling in __gfn_to_pfn_memslot() KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace KVM: Try using fast GUP to resolve read faults KVM: Add memslot flag to let userspace force an exit on missing hva mappings KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler KVM: selftests: Report per-vcpu demand paging rate from demand paging test KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT KVM: selftests: Add memslot_flags parameter to memstress_create_vm() KVM: selftests: Handle memory fault exits in demand_paging_test Documentation/virt/kvm/api.rst | 33 +- arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/arm.c | 1 + arch/arm64/kvm/mmu.c | 7 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 +- arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +- arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/mmu/mmu.c | 8 +- include/linux/kvm_host.h | 21 +- include/uapi/linux/kvm.h | 5 + .../selftests/kvm/aarch64/page_fault_test.c | 4 +- .../selftests/kvm/access_tracking_perf_test.c | 2 +- .../selftests/kvm/demand_paging_test.c | 295 ++++++++++++++---- .../selftests/kvm/dirty_log_perf_test.c | 2 +- .../testing/selftests/kvm/include/memstress.h | 2 +- .../selftests/kvm/include/userfaultfd_util.h | 17 +- tools/testing/selftests/kvm/lib/memstress.c | 4 +- .../selftests/kvm/lib/userfaultfd_util.c | 159 ++++++---- .../kvm/memslot_modification_stress_test.c | 2 +- .../x86_64/dirty_log_page_splitting_test.c | 2 +- virt/kvm/Kconfig | 3 + virt/kvm/kvm_main.c | 46 ++- 22 files changed, 444 insertions(+), 175 deletions(-) -- 2.42.0.869.gea05f2083d-goog