On Wed, 2025-02-26 at 09:08 +0000, David Hildenbrand wrote: > On 26.02.25 09:48, Patrick Roy wrote: >> >> >> On Tue, 2025-02-25 at 16:54 +0000, David Hildenbrand wrote:> On 21.02.25 17:07, Patrick Roy wrote: >>>> Add KVM_GMEM_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() ioctl. When >>>> set, guest_memfd folios will be removed from the direct map after >>>> preparation, with direct map entries only restored when the folios are >>>> freed. >>>> >>>> To ensure these folios do not end up in places where the kernel cannot >>>> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct >>>> address_space if KVM_GMEM_NO_DIRECT_MAP is requested. >>>> >>>> Note that this flag causes removal of direct map entries for all >>>> guest_memfd folios independent of whether they are "shared" or "private" >>>> (although current guest_memfd only supports either all folios in the >>>> "shared" state, or all folios in the "private" state if >>>> !IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM)). The usecase for removing >>>> direct map entries of also the shared parts of guest_memfd are a special >>>> type of non-CoCo VM where, host userspace is trusted to have access to >>>> all of guest memory, but where Spectre-style transient execution attacks >>>> through the host kernel's direct map should still be mitigated. >>>> >>>> Note that KVM retains access to guest memory via userspace >>>> mappings of guest_memfd, which are reflected back into KVM's memslots >>>> via userspace_addr. This is needed for things like MMIO emulation on >>>> x86_64 to work. Previous iterations attempted to instead have KVM >>>> temporarily restore direct map entries whenever such an access to guest >>>> memory was needed, but this turned out to have a significant performance >>>> impact, as well as additional complexity due to needing to refcount >>>> direct map reinsertion operations and making them play nicely with gmem >>>> truncations. >>>> >>>> This iteration also doesn't have KVM perform TLB flushes after direct >>>> map manipulations. This is because TLB flushes resulted in a up to 40x >>>> elongation of page faults in guest_memfd (scaling with the number of CPU >>>> cores), or a 5x elongation of memory population. On the one hand, TLB >>>> flushes are not needed for functional correctness (the virt->phys >>>> mapping technically stays "correct", the kernel should simply to not it >>>> for a while), so this is a correct optimization to make. On the other >>>> hand, it means that the desired protection from Spectre-style attacks is >>>> not perfect, as an attacker could try to prevent a stale TLB entry from >>>> getting evicted, keeping it alive until the page it refers to is used by >>>> the guest for some sensitive data, and then targeting it using a >>>> spectre-gadget. >>>> >>>> Signed-off-by: Patrick Roy <roypat@xxxxxxxxxxxx> >>> >>> ... >>> >>>> >>>> +static bool kvm_gmem_test_no_direct_map(struct inode *inode) >>>> +{ >>>> + return ((unsigned long) inode->i_private) & KVM_GMEM_NO_DIRECT_MAP; >>>> +} >>>> + >>>> static inline void kvm_gmem_mark_prepared(struct folio *folio) >>>> { >>>> + struct inode *inode = folio_inode(folio); >>>> + >>>> + if (kvm_gmem_test_no_direct_map(inode)) { >>>> + int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio), >>>> + false); >>> >>> Will this work if KVM is built as a module, or is this another good >>> reason why we might want guest_memfd core part of core-mm? >> >> mh, I'm admittedly not too familiar with the differences that would come >> from building KVM as a module vs not. I do remember something about the >> direct map accessors not being available for modules, so this would >> indeed not work. Does that mean moving gmem into core-mm will be a >> pre-requisite for the direct map removal stuff? > > Likely, we'd need some shim. > > Maybe for the time being it could be fenced using #if IS_BUILTIN() ... > but that sure won't win in a beauty contest. Is anyone working on such a shim at the moment? Otherwise, would it make sense for me to look into it? (although I'll probably need a pointer or two for what is actually needed) I saw your comment on Fuad's series [1] indicating that he'll also need some shim, so probably makes sense to tackle it anyway instead of hacking around it with #if-ery. [1]: https://lore.kernel.org/kvm/8ddab670-8416-47f2-a5a6-94fb3444f328@xxxxxxxxxx/ > -- > Cheers, > > David / dhildenb > Best, Patrick