There are two stages of page fault. The guest kernel is responsible for handling stage one page fault, while the host kernel is to take care of the stage two page fault. When page fault is triggered because of stage two page fault, the guest is suspended until the requested memory (page) is populated. Sometimes, the cost to populate the requested page isn't cheap and can take hundreds of milliseconds in extreme cases. This impacts the overall guest's performance. This series introduces the feature (asynchronous page fault) to resolve the issue and improve the guest's performance. It depends on the series to support SDEI virtualization and refactoring SDEI client driver. This also depends on QEMU changes to export SDEI/APFT tables. All the code including this series can be found from github: https://github.com/gwshan/linux ("sdei_client") https://github.com/gwshan/linux ("sdei") https://github.com/gwshan/linux ("apf") https://github.com/gwshan/qemu ("apf") The functionality is driven by two notifications: page-not-present and page-ready. They're delivered from the host to guest via SDEI event and PPI separately. In the mean while, each notification is always associated with a token, used to identify the notification. The token is passed by the shared memory between host/guest. Besides, the SMCCC interface is mitigated by the guest to configure, enable or disable the functionality. It's traditional control path. When the guest is trapped to host because of stage two page fault, a page-not-present notification is raised by the host, and sent to the guest through (KVM private) SDEI event (0x40200001) if the requested page can't be populated immediately. In the mean while, a (background) worker is also started to populate the requested page. On receiving the SDEI event, the guest marks the current running process with special flag (TIF_ASYNC_PF) and associates the process with a pre-defined waitqueue. At same time, a (reschedule) IPI is sent to the CPU where the process was running. After the SDEI event is acknoledged by the guest, the (reschedule) IPI is delivered and it causes context switch from kernel to user space. During the context switch, the process with TIF_ASYNC_PF flag is suspended on the associated waitqueue. Later on, a page-ready notification is sent to guest after the requested page is populated by the (background) worker. On receiving the interrupt, the guest uses the associated token to locate the process, which was previously suspended because of page-not-present, and wakes it up. The series is organized as below: PATCH[1-2]: support KVM hypervisor SMCCC services, which are developed by Will Deacon. PATCH[3]: export kvm_handle_user_mem_abort() with @prefault parameter supported, which is prepatory work to support the feature. PATCH[4]: support asynchronous page fault in host side. PATCH[5]: exposes APFT (Asynchronous Page Fault Table) ACPI table, which will be used by guest kernel to support the feature PATCH[6]: support asynchronous page fault in guest side. ======= Testing ======= In the test case [1] and [2], "testsuite mem" is executed to allocate the specified percentage of free memory (90%) and then release them. In the mean while, the calculation thread is started or not. When the calculation thread isn't started, there isn't obvious performance degradtion. When the calculation thread is started, the performance is improved by 27.7% and 28.6% separately, depending on THP enablement sttus on the host side. In test case [3] and [4], the kernel image is built and check the used time. The performance is improved by 9.7% and 9.9% separately, depending on THP enablement status on the host side. [1] Two threads to allocate/free memory and do calculation vCPU: 1 Memory: 8GB memory.limit_in_bytes: 2GB memory.swappiness: 100 host: THP disabled command: "testsuite mem 90 1 [thread]" "-": Disabled asynchronous page fault "+": Enabled asynchronous page fault "T" With the calculation thread Idx - + Output T- T+ Output ========================================================================== 1 93.1s 93.6s - 223.8s 21117147961 391.9s 49845637101 - 2 93.3s 94.2s - 237.9s 23394567744 397.0s 50506074773 - 3 93.5s 94.3s - 244.2s 24305177553 405.8s 51853498870 - 4 94.1s 95.0s - 262.8s 27113310073 421.7s 54338181069 - 5 94.3s 95.2s - 272.7s 28565479414 434.3s 56171922019 - ========================================================================== 93.6s 94.4s -0.8% 248.2s 24899136549 410.1s 52543062766 100318841/s 128122562/s +27.7% [2] Two threads to allocate/free memory and do calculation vCPU: 1 Memory: 8GB memory.limit_in_bytes: 2GB memory.swappiness: 100 host: THP enabled command: "testsuite mem 90 1 [thread]" "-": Disabled asynchronous page fault "+": Enabled asynchronous page fault "T" With the calculation thread Idx - + Output T- T+ Output ========================================================================== 1 91.3s 91.2s - 218.8s 20319612017 389.6s 49016175698 - 2 91.7s 91.6s - 233.9s 22619566161 402.0s 50901616319 - 3 91.8s 91.9s - 251.1s 25066180266 405.3s 51247353704 - 4 92.7s 92.2s - 251.1s 25262121229 406.9s 51692420054 - 5 93.1s 92.2s - 260.7s 26532616925 425.4s 54412348724 - ========================================================================== 92.1s 91.8s +3.0% 243.1s 23960019319 405.8 51453982899 98560342/s 126796409/s +28.6% [3] Clear kernel image and rebuild it. vCPU: 24 Memory: 8GB memory.limit_in_bytes: 2GB memory.swapiness: 100 Host: THP disabled command: "make -j 24 clean > /dev/null 2>&1 && make -j 24 > /dev/null 2>&1" Idx Disabled Enabled Output ================================== 1 2211s 2000s +9.5% 2 2333s 2060s +11.7% 3 2568s 2192s +14.6% 4 2631s 2423s +7.9% 5 2756s 2605s +5.4% ================================== 2499s 2256s +9.7% [4] Clear kernel image and rebuild it. vCPU: 24 Memory: 8GB memory.limit_in_bytes: 2GB memory.swapiness: 100 Host: THP enabled command: "make -j 24 clean > /dev/null 2>&1 && make -j 24 > /dev/null 2>&1" Idx Disabled Enabled Output ================================== 1 2049s 1850s +9.7% 2 2144s 1947s +9.1% 3 2164s 1997s +7.7% 4 2192s 2031s +7.3% 5 2515s 2141s +14.8% ================================== 2214s 1993s +9.9% Gavin Shan (4): kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode arm64/kvm: Support async page fault drivers/acpi: Import ACPI APF table arm64/kernel: Support async page fault Will Deacon (2): arm64: Probe for the presence of KVM hypervisor services during boot arm/arm64: KVM: Advertise KVM UID to guests via SMCCC arch/arm64/Kconfig | 11 + arch/arm64/include/asm/esr.h | 5 + arch/arm64/include/asm/hypervisor.h | 11 + arch/arm64/include/asm/kvm_emulate.h | 8 +- arch/arm64/include/asm/kvm_host.h | 54 +++ arch/arm64/include/asm/kvm_para.h | 41 +++ arch/arm64/include/asm/processor.h | 1 + arch/arm64/include/asm/thread_info.h | 4 +- arch/arm64/include/uapi/asm/Kbuild | 2 - arch/arm64/include/uapi/asm/kvm_para.h | 23 ++ arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/kvm.c | 478 +++++++++++++++++++++++++ arch/arm64/kernel/setup.c | 32 ++ arch/arm64/kernel/signal.c | 17 + arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/Makefile | 1 + arch/arm64/kvm/arm.c | 45 ++- arch/arm64/kvm/async_pf.c | 462 ++++++++++++++++++++++++ arch/arm64/kvm/hypercalls.c | 37 +- arch/arm64/kvm/mmu.c | 47 ++- arch/arm64/kvm/sdei.c | 8 + include/acpi/actbl2.h | 18 + include/linux/arm-smccc.h | 41 +++ 23 files changed, 1321 insertions(+), 27 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_para.h create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h create mode 100644 arch/arm64/kernel/kvm.c create mode 100644 arch/arm64/kvm/async_pf.c -- 2.23.0 _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm