Re: [RFC] Experimental DMA-BUF Device Heaps

James Jones <jajones@xxxxxxxxxx> · Sun, 23 Aug 2020 15:53:50 -0700

On 8/23/20 1:46 PM, Laurent Pinchart wrote:
Hi James,

On Sun, Aug 23, 2020 at 01:04:43PM -0700, James Jones wrote:
On 8/20/20 1:15 AM, Ezequiel Garcia wrote:
On Mon, 2020-08-17 at 20:49 -0700, James Jones wrote:
On 8/17/20 8:18 AM, Brian Starkey wrote:
On Sun, Aug 16, 2020 at 02:22:46PM -0300, Ezequiel Garcia wrote:
This heap is basically a wrapper around DMA-API dma_alloc_attrs,
which will allocate memory suitable for the given device.

The implementation is mostly a port of the Contiguous Videobuf2
memory allocator (see videobuf2/videobuf2-dma-contig.c)
over to the DMA-BUF Heap interface.

The intention of this allocator is to provide applications
with a more system-agnostic API: the only thing the application
needs to know is which device to get the buffer for.

Whether the buffer is backed by CMA, IOMMU or a DMA Pool
is unknown to the application.

I'm not really expecting this patch to be correct or even
a good idea, but just submitting it to start a discussion on DMA-BUF
heap discovery and negotiation.

My initial reaction is that I thought dmabuf heaps are meant for use
to allocate buffers for sharing across devices, which doesn't fit very
well with having per-device heaps.

For single-device allocations, would using the buffer allocation
functionality of that device's native API be better in most
cases? (Some other possibly relevant discussion at [1])

I can see that this can save some boilerplate for devices that want
to expose private chunks of memory, but might it also lead to 100
aliases for the system's generic coherent memory pool?

I wonder if a set of helpers to allow devices to expose whatever they
want with minimal effort would be better.

I'm rather interested on where this goes, as I was toying with using
some sort of heap ID as a basis for a "device-local" constraint in the
memory constraints proposals Simon and I will be discussing at XDC this
year.  It would be rather elegant if there was one type of heap ID used
universally throughout the kernel that could provide a unique handle for
the shared system memory heap(s), as well as accelerator-local heaps on
fancy NICs, GPUs, NN accelerators, capture devices, etc. so apps could
negotiate a location among themselves.  This patch seems to be a step
towards that in a way, but I agree it would be counterproductive if a
bunch of devices that were using the same underlying system memory ended
up each getting their own heap ID just because they used some SW
framework that worked that way.

Would appreciate it if you could send along a pointer to your BoF if it
happens!

Here is it:

https://linuxplumbersconf.org/event/7/contributions/818/

It would be great to see you there and discuss this,
given I was hoping we could talk about how to meet a
userspace allocator library expectations as well.

Thanks!  I hadn't registered for LPC and it looks like it's sold out,
but I'll try to watch the live stream.

This is very interesting, in that it looks like we're both trying to
solve roughly the same set of problems but approaching it from different
angles.  From what I gather, your approach is that a "heap" encompasses
all the allocation constraints a device may have.

The approach Simon Ser and I are tossing around so far is somewhat
different, but may potentially leverage dma-buf heaps a bit as well.

Our approach looks more like what I described at XDC a few years ago,
where memory constraints for a given device's usage of an image are
exposed up to applications, which can then somehow perform boolean
intersection/union operations on them to arrive at a common set of
constraints that describe something compatible with all the devices &
usages desired (or fail to do so, and fall back to copying things around
presumably).  I believe this is more flexible than your initial proposal
in that devices often support multiple usages (E.g., different formats,
different proprietary layouts represented by format modifiers, etc.),
and it avoids adding a combinatorial number of heaps to manage that.

In my view, heaps are more like blobs of memory that can be allocated
from in various different ways to satisfy constraints.  I realize heaps
mean something specific in the dma-buf heap design (specifically,
something closer to an association between an "allocation mechanism" and
"physical memory"), but I hope we don't have massive heap/allocator
mechanism proliferation due to constraints alone.  Perhaps some
constraints, such as contiguous memory or device-local memory, are
properly expressed as a specific heap, but consider the proliferation
implied by even that simple pair of examples: How do you express
contiguous device-local memory?  Do you need to spawn two heaps on the
underlying device-local memory, one for contiguous allocations and one
for non-contiguous allocations?  Seems excessive.

Of course, our approach also has downsides and is still being worked on.
   For example, it works best in an ideal world where all the allocators
available understand all the constraints that exist.

Shouldn't allocators be decoupled of constraints ? In my imagination I
see devices exposing constraints, and allocators exposing parameters,
with a userspace library to reconcile the constraints and produce
allocator parameters from them.

Perhaps another level of abstraction would help.  I'll have to think 
about that.

However, as far as I can tell, it wouldn't remove the need to 
communicate a lot of constraints from multiple engines/devices/etc. to 
the allocator (likely a single allocator.  I'd be interested to know if 
anyone has a design that effectively uses multiple allocators to satisfy 
a single allocation request, but I haven't come up with a good one) 
somehow.  Either the constraints are directly used as the parameters, or 
there's a translation/second level of abstraction, but either way much 
of the information needs to make it to the allocator, or represent the 
need to use a particular allocator.  Simple things like pitch and offset 
alignment can be done without help from a kernel-level allocator, but 
others such as cache coherency, physical memory bank placement, or 
device-local memory will need to make it all the way down to the kernel 
some how I believe.

Thanks,
-James

Dealing with a
reality where there are probably a handful of allocators, another
handful of userspace libraries and APIs, and still more applications
trying to make use of all this is one of the larger remaining challenges
of the design.

We'll present our work at XDC 2020.  Hope you can check that out as well!

1. https://lore.kernel.org/dri-devel/57062477-30e7-a3de-6723-a50d03a402c4@xxxxxxxx/

Given Plumbers is just a couple weeks from now, I've submitted
a BoF proposal to discuss this, as perhaps it would make
sense to discuss this live?

Not-signed-off-by: Ezequiel Garcia <ezequiel@xxxxxxxxxxxxx>