RE: [RFC PATCH 12/21] KVM: IOMMUFD: MEMFD: Map private pages

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Mon, 23 Sep 2024 23:52:19 +0000

> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Tuesday, September 24, 2024 12:03 AM
> 
> On Mon, Sep 23, 2024 at 08:24:40AM +0000, Tian, Kevin wrote:
> > > From: Vishal Annapurve <vannapurve@xxxxxxxxxx>
> > > Sent: Monday, September 23, 2024 2:34 PM
> > >
> > > On Mon, Sep 23, 2024 at 7:36 AM Tian, Kevin <kevin.tian@xxxxxxxxx>
> wrote:
> > > >
> > > > > From: Vishal Annapurve <vannapurve@xxxxxxxxxx>
> > > > > Sent: Saturday, September 21, 2024 5:11 AM
> > > > >
> > > > > On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@xxxxxxxxxx>
> > > wrote:
> > > > > >
> > > > > > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy
> wrote:
> > > > > > > IOMMUFD calls get_user_pages() for every mapping which will
> > > allocate
> > > > > > > shared memory instead of using private memory managed by the
> > > KVM
> > > > > and
> > > > > > > MEMFD.
> > > > > >
> > > > > > Please check this series, it is much more how I would expect this to
> > > > > > work. Use the guest memfd directly and forget about kvm in the
> > > iommufd
> > > > > code:
> > > > > >
> > > > > > https://lore.kernel.org/r/1726319158-283074-1-git-send-email-
> > > > > steven.sistare@xxxxxxxxxx
> > > > > >
> > > > > > I would imagine you'd detect the guest memfd when accepting the
> FD
> > > and
> > > > > > then having some different path in the pinning logic to pin and get
> > > > > > the physical ranges out.
> > > > >
> > > > > According to the discussion at KVM microconference around
> hugepage
> > > > > support for guest_memfd [1], it's imperative that guest private
> memory
> > > > > is not long term pinned. Ideal way to implement this integration
> would
> > > > > be to support a notifier that can be invoked by guest_memfd when
> > > > > memory ranges get truncated so that IOMMU can unmap the
> > > corresponding
> > > > > ranges. Such a notifier should also get called during memory
> > > > > conversion, it would be interesting to discuss how conversion flow
> > > > > would work in this case.
> > > > >
> > > > > [1] https://lpc.events/event/18/contributions/1764/ (checkout the
> > > > > slide 12 from attached presentation)
> > > > >
> > > >
> > > > Most devices don't support I/O page fault hence can only DMA to long
> > > > term pinned buffers. The notifier might be helpful for in-kernel
> conversion
> > > > but as a basic requirement there needs a way for IOMMUFD to call into
> > > > guest memfd to request long term pinning for a given range. That is
> > > > how I interpreted "different path" in Jason's comment.
> > >
> > > Policy that is being aimed here:
> > > 1) guest_memfd will pin the pages backing guest memory for all users.
> > > 2) kvm_gmem_get_pfn users will get a locked folio with elevated
> > > refcount when asking for the pfn/page from guest_memfd. Users will
> > > drop the refcount and release the folio lock when they are done
> > > using/installing (e.g. in KVM EPT/IOMMU PT entries) it. This folio
> > > lock is supposed to be held for short durations.
> > > 3) Users can assume the pfn is around until they are notified by
> > > guest_memfd on truncation or memory conversion.
> > >
> > > Step 3 above is already followed by KVM EPT setup logic for CoCo VMs.
> > > TDX VMs especially need to have secure EPT entries always mapped
> (once
> > > faulted-in) while the guest memory ranges are private.
> >
> > 'faulted-in' doesn't work for device DMAs (w/o IOPF).
> >
> > and above is based on the assumption that CoCo VM will always
> > map/pin the private memory pages until a conversion happens.
> >
> > Conversion is initiated by the guest so ideally the guest is responsible
> > for not leaving any in-fly DMAs to the page which is being converted.
> > From this angle it is fine for IOMMUFD to receive a notification from
> > guest memfd when such a conversion happens.
> 
> Right, I think the expectation is if a guest has active DMA on a page
> it is changing between shared/private there is no expectation that the
> DMA will succeed. So we don't need page fault, we just need to allow
> it to safely fail.
> 
> IMHO we should try to do as best we can here, and the ideal interface
> would be a notifier to switch the shared/private pages in some portion
> of the guestmemfd. With the idea that iommufd could perhaps do it
> atomically.
> 

yes atomic replacement is necessary here, as there might be in-fly
DMAs to pages adjacent to the one being converted in the same
1G hunk. Unmap/remap could potentially break it.