On Mon, Mar 4, 2024 at 12:17 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 04.03.24 20:04, Sean Christopherson wrote: > > On Mon, Mar 04, 2024, Quentin Perret wrote: > >>> As discussed in the sub-thread, that might still be required. > >>> > >>> One could think about completely forbidding GUP on these mmap'ed > >>> guest-memfds. But likely, there might be use cases in the future where you > >>> want to use GUP on shared memory inside a guest_memfd. > >>> > >>> (the iouring example I gave might currently not work because > >>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and > >>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some > >>> details) > >> > >> Perhaps it would be wise to start with GUP being forbidden if the > >> current users do not need it (not sure if that is the case in Android, > >> I'll check) ? We can always relax this constraint later when/if the > >> use-cases arise, which is obviously much harder to do the other way > >> around. > > > > +1000. At least on the KVM side, I would like to be as conservative as possible > > when it comes to letting anything other than the guest access guest_memfd. > > So we'll have to do it similar to any occurrences of "secretmem" in > gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code > similar to e.g., folio_is_secretmem(). > > IIRC, we might not be able to de-reference the actual mapping because it > could get free concurrently ... > > That will then prohibit any kind of GUP access to these pages, including > reading/writing for ptrace/debugging purposes, for core dumping purposes > etc. But at least, you know that nobody was able to optain page > references using GUP that might be used for reading/writing later. > There has been little discussion about supporting 1G pages with guest_memfd for TDX/SNP or pKVM. I would like to restart this discussion [1]. 1G pages should be a very important usecase for guest memfd, especially considering large VM sizes supporting confidential GPU/TPU workloads. Using separate backing stores for private and shared memory ranges is not going to work effectively when using 1G pages. Consider the following scenario of memory conversion when using 1G pages to back private memory: * Guest requests conversion of 4KB range from private to shared, host in response ideally does following steps: a) Updates the guest memory attributes b) Unbacks the corresponding private memory c) Allocates corresponding shared memory or let it be faulted in when guest accesses it Step b above can't be skipped here, otherwise we would have two physical pages (1 backing private memory, another backing the shared memory) for the same GPA range causing "double allocation". With 1G pages, it would be difficult to punch KBs or even MBs sized hole since to support that: 1G page would need to be split (which hugetlbfs doesn't support today because of right reasons), causing - - loss of vmemmap optimization [3] - losing ability to reconstitute the huge page again, especially as private pages in CVMs are not relocatable today, increasing overall fragmentation over time. - unless a smarter algorithm is devised for memory reclaim to reconstitute large pages for unmovable memory. With the above limitations in place, best thing could be to allow: - single backing store for both shared and private memory ranges - host userspace to mmap the guest memfd (as this series is trying to do) - allow userspace to fault in memfd file ranges that correspond to shared GPA ranges - pagetable mappings will need to be restricted to shared memory ranges causing higher granularity mappings (somewhat similar to what HGM series from James [2] was trying to do) than 1G. - Allow IOMMU also to map those pages (pfns would be requested using get_user_pages* APIs) to allow devices to access shared memory. IOMMU management code would have to be enlightened or somehow restricted to map only shared regions of guest memfd. - Upon conversion from shared to private, host will have to ensure that there are no mappings/references present for the memory ranges being converted to private. If the above usecase sounds reasonable, GUP access to guest memfd pages should be allowed. [1] https://lore.kernel.org/lkml/CAGtprH_H1afUJ2cUnznWqYLTZVuEcOogRwXF6uBAeHbLMQsrsQ@xxxxxxxxxxxxxx/ [2] https://lore.kernel.org/lkml/20230218002819.1486479-2-jthoughton@xxxxxxxxxx/ [3] https://docs.kernel.org/mm/vmemmap_dedup.html > -- > Cheers, > > David / dhildenb >