Re: folio_mmapped

Vishal Annapurve <vannapurve@xxxxxxxxxx> · Fri, 29 Mar 2024 11:38:49 -0700

On Thu, Mar 28, 2024 at 4:41 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> ....
> >
> >> The whole reason I brought up the guest_memfd+memfd pair idea is that you
> >> would similarly be able to do the conversion in the kernel, BUT, you'd never
> >> be able to mmap+GUP encrypted pages.
> >>
> >> Essentially you're using guest_memfd for what it was designed for: private
> >> memory that is inaccessible.
> >
> > Ack, that sounds pretty reasonable to me. But I think we'd still want to
> > make sure the other users of guest_memfd have the _desire_ to support
> > huge pages,  migration, swap (probably longer term), and related
> > features, otherwise I don't think a guest_memfd-based option will
> > really work for us :-)
>
> *Probably* some easy way to get hugetlb pages into a guest_memfd would
> be by allocating them for an memfd and then converting/moving them into
> the guest_memfd part of the "fd pair" on conversion to private :)
>
> (but the "partial shared, partial private" case is and remains the ugly
> thing that is hard and I still don't think it makes sense. Maybe it
> could be handles somehow in such a dual approach with some enlightment
> in the fds ... hard to find solutions for things that don't make any
> sense :P )
>

I would again emphasize that this usecase exists for Confidential VMs,
whether we like it or not.

1) TDX hardware allows usage of 1G pages to back guest memory.
2) Larger VM sizes benefit more with 1G page sizes, which would be a
norm with VMs exposing GPU/TPU devices.
3) Confidential VMs will need to share host resources with
non-confidential VMs using 1G pages.
4) When using normal shmem/hugetlbfs files to back guest memory, this
usecase was achievable by just manipulating guest page tables
(although at the cost of host safety which led to invention of guest
memfd). Something equivalent "might be possible" with guest memfd.

Without handling "partial shared, partial private", it is impractical
to support 1G pages for Confidential VMs (discounting any long term
efforts to tame the guest VMs to play nice).

Maybe to handle this usecase, all the host side shared memory usage of
guest memfd (userspace, IOMMU etc) should be associated with (or
tracked via) file ranges rather than offsets within huge pages (like
it's done for faulting in private memory pages when populating guest
EPTs/NPTs). Given the current guest behavior, host MMU and IOMMU may
have to be forced to map shared memory regions always via 4KB
mappings.

> I also do strongly believe that we want to see some HW-assisted
> migration support for guest_memfd pages. Swap, as you say, maybe in the
> long-term. After all, we're not interested in having MM features for
> backing memory that you could similarly find under Windows 95. Wait,
> that one did support swapping! :P
>
> But unfortunately, that's what the shiny new CoCo world currently
> offers. Well, excluding s390x secure execution, as discussed.
>
> --
> Cheers,
>
> David / dhildenb
>