Re: [RFC PATCH 12/21] KVM: IOMMUFD: MEMFD: Map private pages

Jason Gunthorpe <jgg@xxxxxxxxxx> · Mon, 23 Sep 2024 13:02:39 -0300

On Mon, Sep 23, 2024 at 08:24:40AM +0000, Tian, Kevin wrote:
> > From: Vishal Annapurve <vannapurve@xxxxxxxxxx>
> > Sent: Monday, September 23, 2024 2:34 PM
> > 
> > On Mon, Sep 23, 2024 at 7:36 AM Tian, Kevin <kevin.tian@xxxxxxxxx> wrote:
> > >
> > > > From: Vishal Annapurve <vannapurve@xxxxxxxxxx>
> > > > Sent: Saturday, September 21, 2024 5:11 AM
> > > >
> > > > On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@xxxxxxxxxx>
> > wrote:
> > > > >
> > > > > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy wrote:
> > > > > > IOMMUFD calls get_user_pages() for every mapping which will
> > allocate
> > > > > > shared memory instead of using private memory managed by the
> > KVM
> > > > and
> > > > > > MEMFD.
> > > > >
> > > > > Please check this series, it is much more how I would expect this to
> > > > > work. Use the guest memfd directly and forget about kvm in the
> > iommufd
> > > > code:
> > > > >
> > > > > https://lore.kernel.org/r/1726319158-283074-1-git-send-email-
> > > > steven.sistare@xxxxxxxxxx
> > > > >
> > > > > I would imagine you'd detect the guest memfd when accepting the FD
> > and
> > > > > then having some different path in the pinning logic to pin and get
> > > > > the physical ranges out.
> > > >
> > > > According to the discussion at KVM microconference around hugepage
> > > > support for guest_memfd [1], it's imperative that guest private memory
> > > > is not long term pinned. Ideal way to implement this integration would
> > > > be to support a notifier that can be invoked by guest_memfd when
> > > > memory ranges get truncated so that IOMMU can unmap the
> > corresponding
> > > > ranges. Such a notifier should also get called during memory
> > > > conversion, it would be interesting to discuss how conversion flow
> > > > would work in this case.
> > > >
> > > > [1] https://lpc.events/event/18/contributions/1764/ (checkout the
> > > > slide 12 from attached presentation)
> > > >
> > >
> > > Most devices don't support I/O page fault hence can only DMA to long
> > > term pinned buffers. The notifier might be helpful for in-kernel conversion
> > > but as a basic requirement there needs a way for IOMMUFD to call into
> > > guest memfd to request long term pinning for a given range. That is
> > > how I interpreted "different path" in Jason's comment.
> > 
> > Policy that is being aimed here:
> > 1) guest_memfd will pin the pages backing guest memory for all users.
> > 2) kvm_gmem_get_pfn users will get a locked folio with elevated
> > refcount when asking for the pfn/page from guest_memfd. Users will
> > drop the refcount and release the folio lock when they are done
> > using/installing (e.g. in KVM EPT/IOMMU PT entries) it. This folio
> > lock is supposed to be held for short durations.
> > 3) Users can assume the pfn is around until they are notified by
> > guest_memfd on truncation or memory conversion.
> > 
> > Step 3 above is already followed by KVM EPT setup logic for CoCo VMs.
> > TDX VMs especially need to have secure EPT entries always mapped (once
> > faulted-in) while the guest memory ranges are private.
> 
> 'faulted-in' doesn't work for device DMAs (w/o IOPF).
> 
> and above is based on the assumption that CoCo VM will always
> map/pin the private memory pages until a conversion happens.
> 
> Conversion is initiated by the guest so ideally the guest is responsible 
> for not leaving any in-fly DMAs to the page which is being converted.
> From this angle it is fine for IOMMUFD to receive a notification from
> guest memfd when such a conversion happens.

Right, I think the expectation is if a guest has active DMA on a page
it is changing between shared/private there is no expectation that the
DMA will succeed. So we don't need page fault, we just need to allow
it to safely fail.

IMHO we should try to do as best we can here, and the ideal interface
would be a notifier to switch the shared/private pages in some portion
of the guestmemfd. With the idea that iommufd could perhaps do it
atomically.

When the notifier returns then the old pages are fenced off at the
HW.  

But this would have to be a sleepable notifier that can do memory
allocation.

It is actually pretty complicated and we will need a reliable cut
operation to make this work on AMD v1.

Jason