On Tue, May 21, 2019 at 09:24:34AM +0200, Martin Lucina wrote: > Hi all, > > as part of an effort to enforce W^X for the KVM backend of Solo5 [1], I'm > trying to understand how host-side mprotect() interacts with the KVM MMU. > > Take a KVM guest on x86_64, where the guest runs exclusively in long mode, > in virtual ring 0, using 1:1 2MB pages in the guest, and all guest page > tables are RWX, i.e. no memory protection is enforced inside the guest > itself. EPT is enabled on the host. > > Instead, our ELF loader applies a host-side mprotect(PROT_...) based on the > protection bits in the guest application (unikernel) ELF PHDRs. > > The observed behaviour I see, from tests run inside the guest: > > 1. Attempting to WRITE to .text which has had mprotect(PROT_READ | > PROT_EXEC) applied on the host side results in a EFAULT from KVM_RUN in the > userspace tender (our equivalent of a VMM). > > 2. Attempting to EXECUTE code in .data which has had mprotect(PROT_READ | > PROT_WRITE) applied on the host side succeeds. > > Questions: > > a. Is this the intended behaviour, and can it be relied on? Note that > KVM/aarch64 behaves the same for me. > > b. Why does case (1) fail but case (2) succeed? I spent a day reading > through the KVM MMU code, but failed to understand how this is implemented. Case (1) fails because KVM explicitly grabs WRITE permissions when retrieving the HPA. See __gfn_to_pfn_memslot() and hva_to_pfn(). Note, KVM also allows userspace to set a guest memslot as RO independent of mprotect(). Case (2) doesn't fault because KVM doesn't support execute protection, i.e. all pages are executable in the guest (at least on x86). My guess is that execute protection isn't supported because there isn't a strong use case for traditional virtualization and so no one has gone through the effort to add NX support. E.g. the vast majority of system memory can be dynamically allocated (for userspace code), which practically speaking leaves only the guest kernel's data sections, and marking those NX requires at a minimum: - knowing exactly what kernel will be loaded - no ASLR in the physical domain - no transient execution, e.g. in vBIOS or trampoline code > c. In order to enforce W^X both ways I'd like to have case (2) also fail > with EFAULT, is this possible? Not without modifying KVM and the kernel (if you want to do it through mprotect()). > > Martin > > [1] https://github.com/Solo5/solo5