Re: [RFC PATCH 01/12] dma-buf: Introduce dma_buf_get_pfn_unlocked() kAPI

Jason Gunthorpe <jgg@xxxxxxxxxx> · Tue, 14 Jan 2025 13:31:03 -0400

On Tue, Jan 14, 2025 at 03:44:04PM +0100, Simona Vetter wrote:

> E.g. if a compositor gets a dma-buf it assumes that by just binding that
> it will not risk gpu context destruction (unless you're out of memory and
> everything is on fire anyway, and it's ok to die). But if a nasty client
> app supplies a revocable dma-buf, then it can shot down the higher
> priviledged compositor gpu workload with precision. Which is not great, so
> maybe existing dynamic gpu importers should reject revocable dma-buf.
> That's at least what I had in mind as a potential issue.

I see, so it is not that they can't handle a non-present fault it is
just that the non-present effectively turns into a crash of the
context and you want to avoid the crash. It makes sense to me to
negotiate this as part of the API.

> > This is similar to the structure BIO has, and it composes nicely with
> > a future pin_user_pages() and memfd_pin_folios().
> 
> Since you mention pin here, I think that's another aspect of the revocable
> vs dynamic question. Dynamic buffers are expected to sometimes just move
> around for no reason, and importers must be able to cope.

Yes, and we have importers that can tolerate dynamic and those that
can't. Though those that can't tolerate it can often implement revoke.

I view your list as a cascade:
 1) Fully pinned can never be changed so long as the attach is present
 2) Fully pinned, but can be revoked. Revoked is a fatal condition and
    the importer is allowed to experience an error
 3) Fully dynamic and always present. Support for move, and
    restartable fault, is required

Today in RDMA we ask the exporter if it is 1 or 3 and allow different
things. I've seen the GPU side start to offer 1 more often as it has
significant performance wins.

> For recovable exporters/importers I'd expect that movement is not
> happening, meaning it's pinned until the single terminal revocation. And
> maybe I read the kvm stuff wrong, but it reads more like the latter to me
> when crawling through the pfn code.

kvm should be fully faultable and it should be able handle move. It
handles move today using the mmu notifiers after all.

kvm would need to interact with the dmabuf reservations on its page
fault path.

iommufd cannot be faultable and it would only support revoke. For VFIO
revoke would not be fully terminal as VFIO can unrevoke too
(sigh).  If we make revoke special I'd like to eventually include
unrevoke for this reason.

> Once we have the lifetime rules nailed then there's the other issue of how
> to describe the memory, and my take for that is that once the dma-api has
> a clear answer we'll just blindly adopt that one and done.

This is what I hope, we are not there yet, first Leon's series needs
to get merged then we can start on making the DMA API P2P safe without
any struct page. From there it should be clear what direction things
go in.

DMABUF would return pfns annotated with whatever matches the DMA API,
and the importer would be able to inspect the PFNs to learn
information like their P2Pness, CPU mappability or whatever.

I'm pushing for the extra struct, and Christoph has been thinking
about searching a maple tree on the PFN. We need to see what works best.

> And currently with either dynamic attachments and dma_addr_t or through
> fishing the pfn from the cpu pagetables there's some very clearly defined
> lifetime and locking rules (which kvm might get wrong, I've seen some
> discussions fly by where it wasn't doing a perfect job with reflecting pte
> changes, but that was about access attributes iirc). 

Wouldn't surprise me, mmu notifiers are very complex all around. We've
had bugs already where the mm doesn't signal the notifiers at the
right points.

> If we add something
> new, we need clear rules and not just "here's the kvm code that uses it".
> That's how we've done dma-buf at first, and it was a terrible mess of
> mismatched expecations.

Yes, that would be wrong. It should be self defined within dmabuf and
kvm should adopt to it, move semantics and all.

My general desire is to move all of RDMA's MR process away from
scatterlist and work using only the new DMA API. This will save *huge*
amounts of memory in common workloads and be the basis for non-struct
page DMA support, including P2P.

Jason