Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

Chao Peng <chao.p.peng@xxxxxxxxxxxxxxx> · Thu, 13 Oct 2022 21:34:57 +0800

On Fri, Sep 30, 2022 at 05:19:00PM +0100, Fuad Tabba wrote:
> Hi,
> 
> On Tue, Sep 27, 2022 at 11:47 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Mon, Sep 26, 2022, Fuad Tabba wrote:
> > > Hi,
> > >
> > > On Mon, Sep 26, 2022 at 3:28 PM Chao Peng <chao.p.peng@xxxxxxxxxxxxxxx> wrote:
> > > >
> > > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > > > > >
> > > > > >   1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero
> > > > > >      memory into the guest (after pre-boot phase).
> > > > > >
> > > > > >   2. Be mutually exclusive with shared<=>private conversions, and is allowed if
> > > > > >      and only if the entire gfn range of the associated memslot is shared.
> > > > >
> > > > > In general I think that this would work with pKVM. However, limiting
> > > > > private<->shared conversions to the granularity of a whole memslot
> > > > > might be difficult to handle in pKVM, since the guest doesn't have the
> > > > > concept of memslots. For example, in pKVM right now, when a guest
> > > > > shares back its restricted DMA pool with the host it does so at the
> > > > > page-level.
> >
> > Y'all are killing me :-)
> 
>  :D
> 
> > Isn't the guest enlightened?  E.g. can't you tell the guest "thou shalt share at
> > granularity X"?  With KVM's newfangled scalable memslots and per-vCPU MRU slot,
> > X doesn't even have to be that high to get reasonable performance, e.g. assuming
> > the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to
> > work just fine in KVM.
> 
> The guest is potentially enlightened, but the host doesn't necessarily
> know which memslot the guest might want to share back, since it
> doesn't know where the guest might want to place the DMA pool. If I
> understand this correctly, for this to work, all memslots would need
> to be the same size and sharing would always need to happen at that
> granularity.
> 
> Moreover, for something like a small DMA pool this might scale, but
> I'm not sure about potential future workloads (e.g., multimedia
> in-place sharing).
> 
> >
> > > > > pKVM would also need a way to make an fd accessible again
> > > > > when shared back, which I think isn't possible with this patch.
> > > >
> > > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > > be the same issue.
> > >
> > > pKVM doesn't really need to unmap the memory. What is really important
> > > is that the memory is not GUP'able.
> >
> > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> > otherwise KVM wouldn't be able to get the PFN to map into guest memory.
> >
> > The problem is that gup() and "mapped" are tied together.  So yes, pKVM doesn't
> > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
> > the end result is the same.
> >
> > Emphasis above because pKVM still needs unmap the memory _somehwere_.  IIUC, the
> > current approach is to do that only in the stage-2 page tables, i.e. only in the
> > context of the hypervisor.  Which is also the source of the gup() problems; the
> > untrusted kernel is blissfully unaware that the memory is inaccessible.
> >
> > Any approach that moves some of that information into the untrusted kernel so that
> > the kernel can protect itself will incur fragmentation in the VMAs.  Well, unless
> > all of guest memory becomes unguppable, but that's likely not a viable option.
> 
> Actually, for pKVM, there is no need for the guest memory to be
> GUP'able at all if we use the new inaccessible_get_pfn().

If pKVM can use inaccessible_get_pfn() to get pfn and can avoid GUP (I
think that is the major concern?), do you see any other gap from
existing API? 

> This of
> course goes back to what I'd mentioned before in v7; it seems that
> representing the memslot memory as a file descriptor should be
> orthogonal to whether the memory is shared or private, rather than a
> private_fd for private memory and the userspace_addr for shared
> memory. The host can then map or unmap the shared/private memory using
> the fd, which allows it more freedom in even choosing to unmap shared
> memory when not needed, for example.

Using both private_fd and userspace_addr is only needed in TDX and other
confidential computing scenarios, pKVM may only use private_fd if the fd
can also be mmaped as a whole to userspace as Sean suggested.

Thanks,
Chao
> 
> Cheers,
> /fuad