This patchset enables the ability for KVM guests to create execute-only (XO) memory by utilizing EPT based XO permissions. XO memory is currently supported on Intel hardware natively for CPU's with PKU, but this enables it on older platforms, and can support XO for kernel memory as well. In the guest, this patchset enables XO memory for userspace, using the existing interface (mprotect PROT_EXEC && !PROT_READ) used for arm64 and x86 PKU HW. A larger follow on to this enables setting the kernel text as XO, but this is just the KVM pieces and guest userspace. The yet un-posted QEMU patches to work with these changes are here: https://github.com/redgecombe/qemu/ Guest Interface =============== The way XO is exposed to the guest is by creating a virtual XO permission bit in the guest page tables. There are normally four kinds of page table bits: 1. Bits ignored by the hardware 2. Bits that must be 0 or else the hardware throws a RSVD page fault 3. Bits used by the hardware for addresses 4. Bits used by the hardware for permissions and other features We want to find a bit in the guest page tables to use to mean execute-only memory so that guest can map the same physical memory with different permissions simultaneously like other permission bits. We also want the translations to be done by the hardware, which means we can't use ignored or reserved bits. We also can't easily re-purpose a feature bit. This leaves address bits. The idea here is we will take an address bit and re-purpose it as a feature bit. The first thing we have to do is tell the guest that it can't use the address bit we are stealing. Luckily there is an existing CPUID leaf that conveys the number of physical address bits which is already intercepted by KVM, and so we can reduce it as needed. This puts what was previously the top physical address bit into what is defined as the "reserved area" of the PTE. Here is how the PTE would be transformed, where M is the number of physical bits exposed by the CPUID leaf. Normal: |--------------------------------------------------------| | .. | RSVD (51 to M) | PFN (M-1 to 12) | .. | |--------------------------------------------------------| KVM XO (with M reduced by 1): |--------------------------------------------------------| | .. | RSVD (51 to M+1) | XO | PFN (M-1 to 12) | .. | |--------------------------------------------------------| So the way XOM is exposed to the guest is by having the VMM provide two aliases in the guest physical address space for the same memory. The first half has normal EPT permissions, and the second half has XO permissions. This way the high PFN bit in the guest page tables acts like an XO permission bit. The VMM reports to the guest a number of physical address bits that exclude the XO bit, so from the guest perspective the XO bit is in the region that would be "reserved", and from the CPU's perspective the bit is still a normal PFN bit. Backwards Compatibility ----------------------- Since software would have previously received a #PF with the RSVD error code set, when the HW encountered any set bits in the region 51 to M, there was some internal discussion on whether this should have a virtual MSR for the OS to turn it on only if the OS knows it isn't relying on this behavior for bit M. The argument against needing an MSR is this blurb from the Intel SDM about reserved bits: "Bits reserved in the paging-structure entries are reserved for future functionality. Software developers should be aware that such bits may be used in the future and that a paging-structure entry that causes a page-fault exception on one processor might not do so in the future." So in the current patchset there is no MSR write required for the guest to turn on this feature. It will have this behavior whenever qemu is run with "-cpu +xo". KVM XO CPUID Feature Bit ------------------------ Althrough this patchset targets KVM, the idea is that this interface might be implemented by other hypervisors. Especially since as it appears especially like a normal CPU feature it would be nice if there was a single CPUID bit to check for different implementations like there often is for real CPU features. In the past there was a proposal for "generic leaves" [1], where regions are assigned for VMMs to define, but where the behavior will not change across VMMs. This patchset follows this proposal and defines a bit in a new leaf to expose the presense of the above described behavior. I'm hoping to get some suggestions on the right way to expose it by this RFC. Injecting Page Faults --------------------- When there is an attempt to read memory from an XO address range, a #PF is injected into the guest with P=1, W/R=0, RSVD=0, I/D=0. When there is an attempt to write, it is P=1, W/R=1, RSVD=0, I/D=0. Implementation ============== In KVM this patchset adds a new memslot, KVM_MEM_EXECONLY, which maps memory as execute-only via EPT permissions, and will inject a PF to the guest if there is a violation. The x86 emulator is also made aware of XO memory perissions, and virtualized features that act on PFN's are made aware that VTs view of the GFN includes the permission bit (and so needs to be masked to get the guests view of the PFN). QEMU manipulates the physical address bits exposed to the guest and adds an extra KVM_MEM_EXECONLY memslot that points to the same userspace memory in the XO range for every memslot added in the normal range. The violating linear address is determined from the EPT feature that provides the linear address of the violation if availible, and if not availible emulates the violating instruction to determine which linear address to use in the injected fault. Performance =========== The performance impact is not fully characterized yet. In the larger patchset that sets kernel text to be XO, there wasn't any measurable impact compiling the kernel. The hope is that there will not be a large impact, but more testing is needed. Status ====== Regression testing is still needed including the nested virtualization case and impact of XO in the other memslot address spaces. This is based on 5.3. [1] https://lwn.net/Articles/301888/ Rick Edgecombe (13): kvm: Enable MTRR to work with GFNs with perm bits kvm: Add support for X86_FEATURE_KVM_XO kvm: Add XO memslot type kvm, vmx: Add support for gva exit qualification kvm: Add #PF injection for KVM XO kvm: Add KVM_CAP_EXECONLY_MEM kvm: Add docs for KVM_CAP_EXECONLY_MEM x86/boot: Rename USE_EARLY_PGTABLE_L5 x86/cpufeature: Add detection of KVM XO x86/mm: Add NR page bit for KVM XO x86, ptdump: Add NR bit to page table dump mmap: Add XO support for KVM XO x86/Kconfig: Add Kconfig for KVM based XO Documentation/virt/kvm/api.txt | 16 ++-- arch/x86/Kconfig | 13 +++ arch/x86/boot/compressed/misc.h | 2 +- arch/x86/include/asm/cpufeature.h | 7 +- arch/x86/include/asm/cpufeatures.h | 5 +- arch/x86/include/asm/disabled-features.h | 3 +- arch/x86/include/asm/kvm_host.h | 7 ++ arch/x86/include/asm/pgtable_32_types.h | 1 + arch/x86/include/asm/pgtable_64_types.h | 30 ++++++- arch/x86/include/asm/pgtable_types.h | 13 +++ arch/x86/include/asm/required-features.h | 3 +- arch/x86/include/asm/sparsemem.h | 4 +- arch/x86/include/asm/vmx.h | 1 + arch/x86/include/uapi/asm/kvm_para.h | 3 + arch/x86/kernel/cpu/common.c | 7 +- arch/x86/kernel/head64.c | 43 +++++++++- arch/x86/kvm/cpuid.c | 7 ++ arch/x86/kvm/cpuid.h | 1 + arch/x86/kvm/mmu.c | 79 +++++++++++++++++-- arch/x86/kvm/mtrr.c | 8 ++ arch/x86/kvm/paging_tmpl.h | 29 +++++-- arch/x86/kvm/svm.c | 6 ++ arch/x86/kvm/vmx/vmx.c | 6 ++ arch/x86/kvm/x86.c | 9 ++- arch/x86/mm/dump_pagetables.c | 6 +- arch/x86/mm/init.c | 3 + arch/x86/mm/kasan_init_64.c | 2 +- include/uapi/linux/kvm.h | 2 + mm/mmap.c | 30 +++++-- .../arch/x86/include/asm/disabled-features.h | 3 +- tools/include/uapi/linux/kvm.h | 1 + virt/kvm/kvm_main.c | 15 +++- 32 files changed, 322 insertions(+), 43 deletions(-) -- 2.17.1