Re: [PATCH v4 03/12] KVM: guest_memfd: Add flag to remove from direct map

Patrick Roy <roypat@xxxxxxxxxxxx> · Wed, 26 Feb 2025 08:48:16 +0000

On Tue, 2025-02-25 at 16:54 +0000, David Hildenbrand wrote:> On 21.02.25 17:07, Patrick Roy wrote:
>> Add KVM_GMEM_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() ioctl. When
>> set, guest_memfd folios will be removed from the direct map after
>> preparation, with direct map entries only restored when the folios are
>> freed.
>>
>> To ensure these folios do not end up in places where the kernel cannot
>> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
>> address_space if KVM_GMEM_NO_DIRECT_MAP is requested.
>>
>> Note that this flag causes removal of direct map entries for all
>> guest_memfd folios independent of whether they are "shared" or "private"
>> (although current guest_memfd only supports either all folios in the
>> "shared" state, or all folios in the "private" state if
>> !IS_ENABLED(CONFIG_KVM_GMEM_SHARED_MEM)). The usecase for removing
>> direct map entries of also the shared parts of guest_memfd are a special
>> type of non-CoCo VM where, host userspace is trusted to have access to
>> all of guest memory, but where Spectre-style transient execution attacks
>> through the host kernel's direct map should still be mitigated.
>>
>> Note that KVM retains access to guest memory via userspace
>> mappings of guest_memfd, which are reflected back into KVM's memslots
>> via userspace_addr. This is needed for things like MMIO emulation on
>> x86_64 to work. Previous iterations attempted to instead have KVM
>> temporarily restore direct map entries whenever such an access to guest
>> memory was needed, but this turned out to have a significant performance
>> impact, as well as additional complexity due to needing to refcount
>> direct map reinsertion operations and making them play nicely with gmem
>> truncations.
>>
>> This iteration also doesn't have KVM perform TLB flushes after direct
>> map manipulations. This is because TLB flushes resulted in a up to 40x
>> elongation of page faults in guest_memfd (scaling with the number of CPU
>> cores), or a 5x elongation of memory population. On the one hand, TLB
>> flushes are not needed for functional correctness (the virt->phys
>> mapping technically stays "correct",  the kernel should simply to not it
>> for a while), so this is a correct optimization to make. On the other
>> hand, it means that the desired protection from Spectre-style attacks is
>> not perfect, as an attacker could try to prevent a stale TLB entry from
>> getting evicted, keeping it alive until the page it refers to is used by
>> the guest for some sensitive data, and then targeting it using a
>> spectre-gadget.
>>
>> Signed-off-by: Patrick Roy <roypat@xxxxxxxxxxxx>
> 
> ...
> 
>>
>> +static bool kvm_gmem_test_no_direct_map(struct inode *inode)
>> +{
>> +     return ((unsigned long) inode->i_private) & KVM_GMEM_NO_DIRECT_MAP;
>> +}
>> +
>>   static inline void kvm_gmem_mark_prepared(struct folio *folio)
>>   {
>> +     struct inode *inode = folio_inode(folio);
>> +
>> +     if (kvm_gmem_test_no_direct_map(inode)) {
>> +             int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
>> +                                                  false);
> 
> Will this work if KVM is built as a module, or is this another good
> reason why we might want guest_memfd core part of core-mm?

mh, I'm admittedly not too familiar with the differences that would come
from building KVM as a module vs not. I do remember something about the
direct map accessors not being available for modules, so this would
indeed not work. Does that mean moving gmem into core-mm will be a
pre-requisite for the direct map removal stuff?

> -- 
> Cheers,
> 
> David / dhildenb
> 

Best, 
Patrick