From: Lai Jiangshan <jiangshan.ljs@xxxxxxxxxxxx> This RFC series proposes a new virtualization framework built upon the KVM hypervisor that does not require hardware-assisted virtualization techniques. PVM (Pagetable-based virtual machine) is implemented as a new vendor for KVM x86, which is compatible with the KVM virtualization software stack, such as Kata Containers, a secure container technique in a cloud-native environment. The work also led to a paper being accepted at SOSP 2023 [sosp-2023-acm] [sosp-2023-pdf], and Lai delivered a presentation at the symposium in Germany in October 2023 [sosp-2023-slides]: PVM: Efficient Shadow Paging for Deploying Secure Containers in Cloud-native Environment PVM has been adopted by Alibaba Cloud and Ant Group in production to host tens of thousands of secure containers daily, and it has also been adopted by the Openanolis community. Motivation ========== A team in Ant Group, co-creator of Kata Containers along with Intel, deploy the VM-based containers in our public cloud VM to satisfy dynamic resource requests and various needs to isolate workloads. However, for safety, nested virtualization is disabled in the L0 hypervisor, so we cannot use KVM directly. Additionally, the current nested architecture involves complex and expensive transitions between the L0 hypervisor and L1 hypervisor. So the over-arching goals of PVM are to completely decouple secure container hosting from the host hypervisor and hardware virtualization support to: 1) enable nested virtualization within any IaaS clouds without affecting the security, flexibility, and complexity of the cloud platform; 2) avoid costly exits to the host hypervisor and devise efficient world switching mechanisms. Why PVM ======= The PVM hypervisor has the following features: - Compatible with KVM ecosystems. - No requiremment for hardware assistance. Many cloud provider doesn't enable nested virtualization. And it can also enable KVM in TDX/SEV guests. - Flexible. Businesses with secure containers can easily expand in the cloud when demand surges, instead of waiting to accquire bare metal. Cloud vendors often offer lower pricing for spot instances or preemptible VMs. - Help for kernel CI with fast [re-]booting PVM guest kernels nested in cheeper VMs. - Enable light-weight container kernels. Design ====== The design detail can be found in our paper posted in SOSP2023. The framework contains 3 main objects: "Switcher" - The code and data that handling the VM enter and VM exit. "PVM hypervisor" - A new vendor implementation for KVM x86, it uses existed software emulation in KVM for virtualization, e.g., shadow paging, APIC emulation, x86 instruction emulator. "PVM paravirtual guest" - A PIE linux kernel runs in hardware CPL3, and use existed PVOPS to implement optimization. shadowed-user-pagetable shadowed-kernel-pagetable +----------|-----------+ | user | kernel | h_ring3 | (umod) | (smod) | +---+------|--------+--+ syscall | ^ ^ | hypercall/ interrupt/exception | | | | interrupt/exception --------------------------------|----|------|---|------------------------------------ | |sysret| | h_ring0 v | /iret| v +------+------+----+ | switcher | +---------+--------+ vm entry ^ | vm exit (function call)| v (function return) +..............................+..........................................+ . . . +---------------+ +--------------+ . . | kvm.ko | | kvm-pvm.ko | . . +---------------+ +--------------+ . . Virtualization . . memory virtualization CPU Virtualization . +.........................................................................+ PVM hypervisor 1. Switcher: To simplify, we reuse host entries to handle VM enter and VM exit, A flag is introduced to mark that the guest world is switched or during the switch in the entries. Therefore, the guest almost looks like a normal userspace process in the host. 2. Host MMU: The switcher needs to be accessed by the guest, which is similar to the CPU entry area for userspace in KPTI. Therefore, for simplification, we reserved a range of PGDs for the guest, and the guest kernel can only be allowed to run in this range. During the root SP allocation, the host PGDs of the switcher will be cloned into the guest SPT. 3. Event delivery: A new event delivery is used instead of the IDT-based event delivery. The event delivery in PVM is similar to FRED. Design Decisions ================ In designing PVM, many decisions have been made and explained in the patches. "Integral entry", "Exclusive address space separation and PIE guest", and "Simple spec design" are among important decisions besides for "KVM ecosystems" and "Ring3+Pagetable for privilege seperation". Integral entry -------------- The PVM switcher is integrated into the host kernel's entry code, providing the following advantages: - Full control: In XENPV/Lguest, the host Linux (dom0) entry code is subordinate to the hypervisor/switcher, and the host Linux kernel loses control over the entry code. This can cause inconvenience if there is a need to update something when there is a bug in the switcher or hardware. Integral entry gives the control back to the host kernel. - Zero overhead incurred: The integrated entry code doesn't cause any overhead in host Linux entry path, thanks to the discreet design with PVM code in the switcher, where the PVM path is bypassed on host events. While in XENPV/Lguest, host events must be handled by the hypervisor/switcher before being processed. - Integral design allows all aspects of the entry and switcher to be considered together. This RFC patchset doesn't include the complete design for integral entry. It requires fixing the issue with IST [atomic-ist-entry]. And it would be better with the conversion of some ASM code to C code [asm-to-c] (The link provided is not the final version, and some partial patchset had sent separately later on). The new version of the patches for converting ASM code and fixing the IST problem will be updated and sent separately later. Without the complete integral entry code, this patchset still has unresolved issues related to IST, KPTI, and so on. Exclusive address space separation and PIE guest ------------------------------------------------ In the higher half of the address spaces (where the most significant bits in the addresses are 1s), the address ranges that a PVM guest is allowed are exclusive from the host kernel. - The exclusivity of the address makes it possible to design the integral entry because the switcher needs to be mapped for all guests. - The exclusivity of the address allows the host kernel to still utilize global pages and save TLB entries. (XENPV doesn't allow it) - With exclusivity, the existing shadow page table code can be reused with very few changes. The shadow page table contains both the guest portions and the host portions. - Exclusivity necessitates the use of a Position-Independent Executable (PIE) guest since the host kernel occupies the top 2GB of the address space. - With PIE kernel, the PVM guest kernel in hardware ring3 can be located in the lower half of the address spaces in the future when Linear Address Space Separation (LASS) is enabled. This RFC patchset doesn't contain PIE patches which are not specific to PVM and our effort to make linux kernel PIE continues. Simple spec design ------------------ Designing a new paravirtualized guest is not an ideal opportunity to redesign the specification. However, in order to avoid the known flaws of x86_64 and enable the paravirtualized ABI on hardware ring3, the x86 PVM specification has some moderate differences from the x86 specification. - Remove/Ignore most indirect tables and 32-bit supervisor mode. - Simplified event delivery and the removal of IST. - Add some software synthetic instructions. See more details in the patch1 which contains the whole x86 PVM specification. Status ====== Current some features are not supported or disabled in PVM. - SMAP/SMEP can't be enabled directly, however, we can use PKU to emulate SMAP and use NX to emulate SMEP. - 5-level paging is not fully implemented. - Speculative control for guest is disabled. - LDT is not supported. - PMU virtualization is not implemented. Actually, we have reused the current code in pmu_intel.c and pmu_amd.c to implement it. PVM has been adopted in Alibaba Cloud and Ant Group for hosting secure containers, providing a more performant and cost-effective option for cloud users. Performance drawback ==================== The most significant drawback of PVM is shadowpaging. Shadowpaging results in very bad performance when guest applications frequently modify pagetable, including excessive processes forking. However, many long-running cloud services, such as Java, modify pagetables less frequently and can perform very well with shadowpaging. In some cases, they can even outperform EPT since they can avoid EPT TLB entries. Furthermore, PVM can utilize host PCIDs for guest processes, providing a finer-grained approach compared to VPID/ASID. To mitigate the performance problem, we designed several optimizations for the shadow MMU (not included in the patchset) and also planning to build a shadow EPT in L0 for L2 PVM guests. See the paper for more optimizations and the performance details. Future plans ============ Some optimizations are not covered in this series now. - Parallel Page fault for SPT and Paravirtualized MMU Optimization. - Post interrupt emulation. - Relocate guest kernel into userspace address range. - More flexible container solutions based on it. Patches layout ============== [01-02]: PVM ABI documentation and header [03-04]: Switcher implementation [05-49]: PVM hypervisor implementation - 05-13: KVM module involved changes - 14-49: PVM module implementation patch 15: Add a vmalloc helper to reserve a kernel address range for guest. patch 19: Export 32-bit ignore syscall for PVM. [50-73]: PVM guest implementation - 50-52: Pack relocation information into vmlinux and allow it to do relocation. - 53: Introduce Kconfig and cpu features. - 54-59: Relocate guest kernel to the allowed range. - 60-65: Event handling and hypercall. - 66-69: PVOPS implementation. - 70-73: Disable some features and syscalls. Code base ========= The code base is at branch [linux-pie] which is commit ceb6a6f023fd ("Linu 6.7-rc6") + PIE series [pie-patchset]. Complete code can be found at [linux-pvm]. Testing ======= Testing with Kata Containers can be found at [pvm-get-started]. We also provide a VM image based on the `Official Ubuntu Cloud Image`, which has containerd, kata, pvm hypervisor/guest, and configurations prepared and you can use to test Kata Containers with PVM directly. [pvm-get-started-nested-in-vm] [sosp-2023-acm]: https://dl.acm.org/doi/10.1145/3600006.3613158 [sosp-2023-pdf]: https://github.com/virt-pvm/misc/blob/main/sosp2023-pvm-paper.pdf [sosp-2023-slides]: https://github.com/virt-pvm/misc/blob/main/sosp2023-pvm-slides.pptx [asm-to-c]: https://lore.kernel.org/lkml/20211126101209.8613-1-jiangshanlai@xxxxxxxxx/ [atomic-ist-entry]: https://lore.kernel.org/lkml/20230403140605.540512-1-jiangshanlai@xxxxxxxxx/ [pie-patchset]: https://lore.kernel.org/lkml/cover.1682673542.git.houwenlong.hwl@xxxxxxxxxxxx [linux-pie]: https://github.com/virt-pvm/linux/tree/pie [linux-pvm]: https://github.com/virt-pvm/linux/tree/pvm [pvm-get-started]: https://github.com/virt-pvm/misc/blob/main/pvm-get-started-with-kata.md [pvm-get-started-nested-in-vm]: https://github.com/virt-pvm/misc/blob/main/pvm-get-started-with-kata.md#verify-kata-containers-with-pvm-using-vm-image Hou Wenlong (22): KVM: x86: Allow hypercall handling to not skip the instruction KVM: x86: Implement gpc refresh for guest usage KVM: x86/emulator: Reinject #GP if instruction emulation failed for PVM mm/vmalloc: Add a helper to reserve a contiguous and aligned kernel virtual area x86/entry: Export 32-bit ignore syscall entry and __ia32_enabled variable KVM: x86/PVM: Support for kvm_exit() tracepoint KVM: x86/PVM: Support for CPUID faulting x86/tools/relocs: Cleanup cmdline options x86/tools/relocs: Append relocations into input file x86/boot: Allow to do relocation for uncompressed kernel x86/pvm: Relocate kernel image to specific virtual address range x86/pvm: Relocate kernel image early in PVH entry x86/pvm: Make cpu entry area and vmalloc area variable x86/pvm: Relocate kernel address space layout x86/pvm: Allow to install a system interrupt handler x86/pvm: Add early kernel event entry and dispatch code x86/pvm: Enable PVM event delivery x86/pvm: Use new cpu feature to describe XENPV and PVM x86/pvm: Don't use SWAPGS for gsbase read/write x86/pvm: Adapt pushf/popf in this_cpu_cmpxchg16b_emu() x86/pvm: Use RDTSCP as default in vdso_read_cpunode() x86/pvm: Disable some unsupported syscalls and features Lai Jiangshan (51): KVM: Documentation: Add the specification for PVM x86/ABI/PVM: Add PVM-specific ABI header file x86/entry: Implement switcher for PVM VM enter/exit x86/entry: Implement direct switching for the switcher KVM: x86: Set 'vcpu->arch.exception.injected' as true before vendor callback KVM: x86: Move VMX interrupt/nmi handling into kvm.ko KVM: x86/mmu: Adapt shadow MMU for PVM KVM: x86: Add PVM virtual MSRs into emulated_msrs_all[] KVM: x86: Introduce vendor feature to expose vendor-specific CPUID KVM: x86: Add NR_VCPU_SREG in SREG enum KVM: x86: Create stubs for PVM module as a new vendor KVM: x86/PVM: Implement host mmu initialization KVM: x86/PVM: Implement module initialization related callbacks KVM: x86/PVM: Implement VM/VCPU initialization related callbacks KVM: x86/PVM: Implement vcpu_load()/vcpu_put() related callbacks KVM: x86/PVM: Implement vcpu_run() callbacks KVM: x86/PVM: Handle some VM exits before enable interrupts KVM: x86/PVM: Handle event handling related MSR read/write operation KVM: x86/PVM: Introduce PVM mode switching KVM: x86/PVM: Implement APIC emulation related callbacks KVM: x86/PVM: Implement event delivery flags related callbacks KVM: x86/PVM: Implement event injection related callbacks KVM: x86/PVM: Handle syscall from user mode KVM: x86/PVM: Implement allowed range checking for #PF KVM: x86/PVM: Implement segment related callbacks KVM: x86/PVM: Implement instruction emulation for #UD and #GP KVM: x86/PVM: Enable guest debugging functions KVM: x86/PVM: Handle VM-exit due to hardware exceptions KVM: x86/PVM: Handle ERETU/ERETS synthetic instruction KVM: x86/PVM: Handle PVM_SYNTHETIC_CPUID synthetic instruction KVM: x86/PVM: Handle KVM hypercall KVM: x86/PVM: Use host PCID to reduce guest TLB flushing KVM: x86/PVM: Handle hypercalls for privilege instruction emulation KVM: x86/PVM: Handle hypercall for CR3 switching KVM: x86/PVM: Handle hypercall for loading GS selector KVM: x86/PVM: Allow to load guest TLS in host GDT KVM: x86/PVM: Enable direct switching KVM: x86/PVM: Implement TSC related callbacks KVM: x86/PVM: Add dummy PMU related callbacks KVM: x86/PVM: Handle the left supported MSRs in msrs_to_save_base[] KVM: x86/PVM: Implement system registers setting callbacks KVM: x86/PVM: Implement emulation for non-PVM mode x86/pvm: Add Kconfig option and the CPU feature bit for PVM guest x86/pvm: Detect PVM hypervisor support x86/pti: Force enabling KPTI for PVM guest x86/pvm: Add event entry/exit and dispatch code x86/pvm: Add hypercall support x86/kvm: Patch KVM hypercall as PVM hypercall x86/pvm: Implement cpu related PVOPS x86/pvm: Implement irq related PVOPS x86/pvm: Implement mmu related PVOPS Documentation/virt/kvm/x86/pvm-spec.rst | 989 +++++++ arch/x86/Kconfig | 32 + arch/x86/Makefile.postlink | 9 +- arch/x86/entry/Makefile | 4 + arch/x86/entry/calling.h | 47 +- arch/x86/entry/common.c | 1 + arch/x86/entry/entry_64.S | 75 +- arch/x86/entry/entry_64_pvm.S | 189 ++ arch/x86/entry/entry_64_switcher.S | 270 ++ arch/x86/entry/vsyscall/vsyscall_64.c | 4 + arch/x86/include/asm/alternative.h | 14 + arch/x86/include/asm/cpufeatures.h | 2 + arch/x86/include/asm/disabled-features.h | 8 +- arch/x86/include/asm/idtentry.h | 12 +- arch/x86/include/asm/init.h | 5 + arch/x86/include/asm/kvm-x86-ops.h | 2 + arch/x86/include/asm/kvm_host.h | 30 +- arch/x86/include/asm/kvm_para.h | 7 + arch/x86/include/asm/page_64.h | 3 + arch/x86/include/asm/paravirt.h | 14 +- arch/x86/include/asm/pgtable_64_types.h | 14 +- arch/x86/include/asm/processor.h | 5 + arch/x86/include/asm/ptrace.h | 5 + arch/x86/include/asm/pvm_para.h | 103 + arch/x86/include/asm/segment.h | 14 +- arch/x86/include/asm/switcher.h | 119 + arch/x86/include/uapi/asm/kvm_para.h | 8 +- arch/x86/include/uapi/asm/pvm_para.h | 131 + arch/x86/kernel/Makefile | 1 + arch/x86/kernel/asm-offsets_64.c | 31 + arch/x86/kernel/cpu/common.c | 11 + arch/x86/kernel/head64.c | 10 + arch/x86/kernel/head64_identity.c | 108 +- arch/x86/kernel/head_64.S | 34 + arch/x86/kernel/idt.c | 2 + arch/x86/kernel/kvm.c | 2 + arch/x86/kernel/ldt.c | 3 + arch/x86/kernel/nmi.c | 8 +- arch/x86/kernel/process_64.c | 10 +- arch/x86/kernel/pvm.c | 579 ++++ arch/x86/kernel/traps.c | 3 + arch/x86/kernel/vmlinux.lds.S | 18 + arch/x86/kvm/Kconfig | 9 + arch/x86/kvm/Makefile | 5 +- arch/x86/kvm/cpuid.c | 26 +- arch/x86/kvm/cpuid.h | 3 + arch/x86/kvm/host_entry.S | 50 + arch/x86/kvm/mmu/mmu.c | 36 +- arch/x86/kvm/mmu/paging_tmpl.h | 3 + arch/x86/kvm/mmu/spte.c | 4 + arch/x86/kvm/pvm/host_mmu.c | 119 + arch/x86/kvm/pvm/pvm.c | 3257 ++++++++++++++++++++++ arch/x86/kvm/pvm/pvm.h | 169 ++ arch/x86/kvm/svm/svm.c | 4 + arch/x86/kvm/trace.h | 7 +- arch/x86/kvm/vmx/vmenter.S | 43 - arch/x86/kvm/vmx/vmx.c | 18 +- arch/x86/kvm/x86.c | 33 +- arch/x86/kvm/x86.h | 18 + arch/x86/mm/dump_pagetables.c | 3 +- arch/x86/mm/kaslr.c | 8 +- arch/x86/mm/pti.c | 7 + arch/x86/platform/pvh/enlighten.c | 22 + arch/x86/platform/pvh/head.S | 4 + arch/x86/tools/relocs.c | 88 +- arch/x86/tools/relocs.h | 20 +- arch/x86/tools/relocs_common.c | 38 +- arch/x86/xen/enlighten_pv.c | 1 + include/linux/kvm_host.h | 10 + include/linux/vmalloc.h | 2 + include/uapi/Kbuild | 4 + mm/vmalloc.c | 10 + virt/kvm/pfncache.c | 2 +- 73 files changed, 6793 insertions(+), 166 deletions(-) create mode 100644 Documentation/virt/kvm/x86/pvm-spec.rst create mode 100644 arch/x86/entry/entry_64_pvm.S create mode 100644 arch/x86/entry/entry_64_switcher.S create mode 100644 arch/x86/include/asm/pvm_para.h create mode 100644 arch/x86/include/asm/switcher.h create mode 100644 arch/x86/include/uapi/asm/pvm_para.h create mode 100644 arch/x86/kernel/pvm.c create mode 100644 arch/x86/kvm/host_entry.S create mode 100644 arch/x86/kvm/pvm/host_mmu.c create mode 100644 arch/x86/kvm/pvm/pvm.c create mode 100644 arch/x86/kvm/pvm/pvm.h -- 2.19.1.6.gb485710b