On Tue, 2025-02-25 at 16:54 +0000, David Hildenbrand wrote:> On 21.02.25 17:07, Patrick Roy wrote: >> Add KVM_GMEM_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() ioctl. When >> set, guest_memfd folios will be removed from the direct map after >> preparation, with direct map entries only restored when the folios are >> freed. >> >> To ensure these folios do not end up in places where the kernel cannot >> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct >> address_space if KVM_GMEM_NO_DIRECT_MAP is requested. >> >> Note that this flag causes removal of direct map entries for all >> guest_memfd folios independent of whether they are "shared" or "private" >> (although current guest_memfd only supports either all folios in the >> "shared" state, or all folios in the "private" state if >> !IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM)). The usecase for removing >> direct map entries of also the shared parts of guest_memfd are a special >> type of non-CoCo VM where, host userspace is trusted to have access to >> all of guest memory, but where Spectre-style transient execution attacks >> through the host kernel's direct map should still be mitigated. >> >> Note that KVM retains access to guest memory via userspace >> mappings of guest_memfd, which are reflected back into KVM's memslots >> via userspace_addr. This is needed for things like MMIO emulation on >> x86_64 to work. Previous iterations attempted to instead have KVM >> temporarily restore direct map entries whenever such an access to guest >> memory was needed, but this turned out to have a significant performance >> impact, as well as additional complexity due to needing to refcount >> direct map reinsertion operations and making them play nicely with gmem >> truncations. >> >> This iteration also doesn't have KVM perform TLB flushes after direct >> map manipulations. This is because TLB flushes resulted in a up to 40x >> elongation of page faults in guest_memfd (scaling with the number of CPU >> cores), or a 5x elongation of memory population. On the one hand, TLB >> flushes are not needed for functional correctness (the virt->phys >> mapping technically stays "correct", the kernel should simply to not it >> for a while), so this is a correct optimization to make. On the other >> hand, it means that the desired protection from Spectre-style attacks is >> not perfect, as an attacker could try to prevent a stale TLB entry from >> getting evicted, keeping it alive until the page it refers to is used by >> the guest for some sensitive data, and then targeting it using a >> spectre-gadget. >> >> Signed-off-by: Patrick Roy <roypat@xxxxxxxxxxxx> > > ... > >> >> +static bool kvm_gmem_test_no_direct_map(struct inode *inode) >> +{ >> + return ((unsigned long) inode->i_private) & KVM_GMEM_NO_DIRECT_MAP; >> +} >> + >> static inline void kvm_gmem_mark_prepared(struct folio *folio) >> { >> + struct inode *inode = folio_inode(folio); >> + >> + if (kvm_gmem_test_no_direct_map(inode)) { >> + int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio), >> + false); > > Will this work if KVM is built as a module, or is this another good > reason why we might want guest_memfd core part of core-mm? mh, I'm admittedly not too familiar with the differences that would come from building KVM as a module vs not. I do remember something about the direct map accessors not being available for modules, so this would indeed not work. Does that mean moving gmem into core-mm will be a pre-requisite for the direct map removal stuff? > -- > Cheers, > > David / dhildenb > Best, Patrick