On Fri, May 22, 2020 at 03:51:58PM +0300, Kirill A. Shutemov wrote: > == Background / Problem == > > There are a number of hardware features (MKTME, SEV) which protect guest > memory from some unauthorized host access. The patchset proposes a purely > software feature that mitigates some of the same host-side read-only > attacks. CC people who worked on the related patchsets. > == What does this set mitigate? == > > - Host kernel ”accidental” access to guest data (think speculation) > > - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len)) > > - Host userspace access to guest data (compromised qemu) > > == What does this set NOT mitigate? == > > - Full host kernel compromise. Kernel will just map the pages again. > > - Hardware attacks > > > The patchset is RFC-quality: it works but has known issues that must be > addressed before it can be considered for applying. > > We are looking for high-level feedback on the concept. Some open > questions: > > - This protects from some kernel and host userspace read-only attacks, > but does not place the host kernel outside the trust boundary. Is it > still valuable? > > - Can this approach be used to avoid cache-coherency problems with > hardware encryption schemes that repurpose physical bits? > > - The guest kernel must be modified for this to work. Is that a deal > breaker, especially for public clouds? > > - Are the costs of removing pages from the direct map too high to be > feasible? > > == Series Overview == > > The hardware features protect guest data by encrypting it and then > ensuring that only the right guest can decrypt it. This has the > side-effect of making the kernel direct map and userspace mapping > (QEMU et al) useless. But, this teaches us something very useful: > neither the kernel or userspace mappings are really necessary for normal > guest operations. > > Instead of using encryption, this series simply unmaps the memory. One > advantage compared to allowing access to ciphertext is that it allows bad > accesses to be caught instead of simply reading garbage. > > Protection from physical attacks needs to be provided by some other means. > On Intel platforms, (single-key) Total Memory Encryption (TME) provides > mitigation against physical attacks, such as DIMM interposers sniffing > memory bus traffic. > > The patchset modifies both host and guest kernel. The guest OS must enable > the feature via hypercall and mark any memory range that has to be shared > with the host: DMA regions, bounce buffers, etc. SEV does this marking via a > bit in the guest’s page table while this approach uses a hypercall. > > For removing the userspace mapping, use a trick similar to what NUMA > balancing does: convert memory that belongs to KVM memory slots to > PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and > the newly faulted in pages get PROT_NONE from the updated vm_page_prot. > The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the > VMA must be treated in a special way in the GUP and fault paths. The flag > allows GUP to return the page even though it is mapped with PROT_NONE, but > only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access > to the memory would result in SIGBUS. Any GUP access without FOLL_KVM > would result in -EFAULT. > > Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from > the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only > flushes local TLB. I think it's a reasonable compromise between security and > perfromance. > > Zapping the PTE would bring the page back to the direct mapping after clearing. > At least for now, we don't remove file-backed pages from the direct mapping. > File-backed pages could be accessed via read/write syscalls. It adds > complexity. > > Occasionally, host kernel has to access guest memory that was not made > shared by the guest. For instance, it happens for instruction emulation. > Normally, it's done via copy_to/from_user() which would fail with -EFAULT > now. We introduced a new pair of helpers: copy_to/from_guest(). The new > helpers acquire the page via GUP, map it into kernel address space with > kmap_atomic()-style mechanism and only then copy the data. > > For some instruction emulation copying is not good enough: cmpxchg > emulation has to have direct access to the guest memory. __kvm_map_gfn() > is modified to accommodate the case. > > The patchset is on top of v5.7-rc6 plus this patch: > > https://lkml.kernel.org/r/20200402172507.2786-1-jimmyassarsson@xxxxxxxxx > > == Open Issues == > > Unmapping the pages from direct mapping bring a few of issues that have > not rectified yet: > > - Touching direct mapping leads to fragmentation. We need to be able to > recover from it. I have a buggy patch that aims at recovering 2M/1G page. > It has to be fixed and tested properly > > - Page migration and KSM is not supported yet. > > - Live migration of a guest would require a new flow. Not sure yet how it > would look like. > > - The feature interfere with NUMA balancing. Not sure yet if it's > possible to make them work together. > > - Guests have no mechanism to ensure that even a well-behaving host has > unmapped its private data. With SEV, for instance, the guest only has > to trust the hardware to encrypt a page after the C bit is set in a > guest PTE. A mechanism for a guest to query the host mapping state, or > to constantly assert the intent for a page to be Private would be > valuable. -- Kirill A. Shutemov