Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

Fuad Tabba <tabba@xxxxxxxxxx> · Fri, 30 Sep 2022 17:19:00 +0100

Hi,

On Tue, Sep 27, 2022 at 11:47 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Mon, Sep 26, 2022, Fuad Tabba wrote:
> > Hi,
> >
> > On Mon, Sep 26, 2022 at 3:28 PM Chao Peng <chao.p.peng@xxxxxxxxxxxxxxx> wrote:
> > >
> > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > > > >
> > > > >   1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> > > > >      memory into the guest (after pre-boot phase).
> > > > >
> > > > >   2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> > > > >      and only if the entire gfn range of the associated memslot is shared.
> > > >
> > > > In general I think that this would work with pKVM. However, limiting
> > > > private<->shared conversions to the granularity of a whole memslot
> > > > might be difficult to handle in pKVM, since the guest doesn't have the
> > > > concept of memslots. For example, in pKVM right now, when a guest
> > > > shares back its restricted DMA pool with the host it does so at the
> > > > page-level.
>
> Y'all are killing me :-)

 :D

> Isn't the guest enlightened?  E.g. can't you tell the guest "thou shalt share at
> granularity X"?  With KVM's newfangled scalable memslots and per-vCPU MRU slot,
> X doesn't even have to be that high to get reasonable performance, e.g. assuming
> the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to
> work just fine in KVM.

The guest is potentially enlightened, but the host doesn't necessarily
know which memslot the guest might want to share back, since it
doesn't know where the guest might want to place the DMA pool. If I
understand this correctly, for this to work, all memslots would need
to be the same size and sharing would always need to happen at that
granularity.

Moreover, for something like a small DMA pool this might scale, but
I'm not sure about potential future workloads (e.g., multimedia
in-place sharing).

>
> > > > pKVM would also need a way to make an fd accessible again
> > > > when shared back, which I think isn't possible with this patch.
> > >
> > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > be the same issue.
> >
> > pKVM doesn't really need to unmap the memory. What is really important
> > is that the memory is not GUP'able.
>
> Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> otherwise KVM wouldn't be able to get the PFN to map into guest memory.
>
> The problem is that gup() and "mapped" are tied together.  So yes, pKVM doesn't
> strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
> the end result is the same.
>
> Emphasis above because pKVM still needs unmap the memory _somehwere_.  IIUC, the
> current approach is to do that only in the stage-2 page tables, i.e. only in the
> context of the hypervisor.  Which is also the source of the gup() problems; the
> untrusted kernel is blissfully unaware that the memory is inaccessible.
>
> Any approach that moves some of that information into the untrusted kernel so that
> the kernel can protect itself will incur fragmentation in the VMAs.  Well, unless
> all of guest memory becomes unguppable, but that's likely not a viable option.

Actually, for pKVM, there is no need for the guest memory to be
GUP'able at all if we use the new inaccessible_get_pfn(). This of
course goes back to what I'd mentioned before in v7; it seems that
representing the memslot memory as a file descriptor should be
orthogonal to whether the memory is shared or private, rather than a
private_fd for private memory and the userspace_addr for shared
memory. The host can then map or unmap the shared/private memory using
the fd, which allows it more freedom in even choosing to unmap shared
memory when not needed, for example.

Cheers,
/fuad