Re: DMA-heap driver hints

James Jones <jajones@xxxxxxxxxx> · Wed, 25 Jan 2023 15:20:21 -0800

On 1/24/23 15:14, T.J. Mercier wrote:
On Mon, Jan 23, 2023 at 11:49 PM Christian König
<christian.koenig@xxxxxxx> wrote:

Am 24.01.23 um 04:56 schrieb James Jones:
On 1/23/23 08:58, Laurent Pinchart wrote:
Hi Christian,

On Mon, Jan 23, 2023 at 05:29:18PM +0100, Christian König wrote:
Am 23.01.23 um 14:55 schrieb Laurent Pinchart:
Hi Christian,

CC'ing James as I think this is related to his work on the unix device
memory allocator ([1]).

Thank you for including me.

Sorry for not having you in initially. I wasn't aware of your previous
work in this area.

No worries. I've embarrassingly made no progress here since the last XDC 
talk, so I wouldn't expect everyone to know or remember.

[1]
https://lore.kernel.org/dri-devel/8b555674-1c5b-c791-4547-2ea7c16aee6c@xxxxxxxxxx/

On Mon, Jan 23, 2023 at 01:37:54PM +0100, Christian König wrote:
Hi guys,

this is just an RFC! The last time we discussed the DMA-buf coherency
problem [1] we concluded that DMA-heap first needs a better way to
communicate to userspace which heap to use for a certain device.

As far as I know userspace currently just hard codes that information
which is certainly not desirable considering that we should have this
inside the kernel as well.

So what those two patches here do is to first add some
dma_heap_create_device_link() and dma_heap_remove_device_link()
function and then demonstrating the functionality with uvcvideo
driver.

The preferred DMA-heap is represented with a symlink in sysfs between
the device and the virtual DMA-heap device node.

I'll start with a few high-level comments/questions:

- Instead of tying drivers to heaps, have you considered a system
where
     a driver would expose constraints, and a heap would then be
selected
     based on those constraints ? A tight coupling between heaps and
     drivers means downstream patches to drivers in order to use
     vendor-specific heaps, that sounds painful.

I was wondering the same thing as well, but came to the conclusion that
just the other way around is the less painful approach.

  From a kernel point of view, sure, it's simpler and thus less painful.
  From the point of view of solving the whole issue, I'm not sure :-)

The problem is that there are so many driver specific constrains that I
don't even know where to start from.

That's where I was hoping James would have some feedback for us, based
on the work he did on the Unix device memory allocator. If that's not
the case, we can brainstorm this from scratch.

Simon Ser's and my presentation from XDC 2020 focused entirely on
this. The idea was not to try to enumerate every constraint up front,
but rather to develop an extensible mechanism that would be flexible
enough to encapsulate many disparate types of constraints and perform
set operations on them (merging sets was the only operation we tried
to solve). Simon implemented a prototype header-only library to
implement the mechanism:

https://gitlab.freedesktop.org/emersion/drm-constraints

The links to the presentation and talk are below, along with notes
from the follow-up workshop.

https://lpc.events/event/9/contributions/615/attachments/704/1301/XDC_2020__Allocation_Constraints.pdf

https://www.youtube.com/watch?v=HZEClOP5TIk
https://paste.sr.ht/~emersion/c43b30be08bab1882f1b107402074462bba3b64a

Note one of the hard parts of this was figuring out how to express a
device or heap within the constraint structs. One of the better ideas
proposed back then was something like heap IDs, where dma heaps would
each have one,

We already have that. Each dma_heap has it's own unique name.

Cool.

and devices could register their own heaps (or even just themselves?)
with the heap subsystem and be assigned a locally-unique ID that
userspace could pass around.

I was more considering that we expose some kind of flag noting that a
certain device needs its buffer allocated from that device to utilize
all use cases.

This sounds similar to what you're proposing. Perhaps a reasonable
identifier is a device (major, minor) pair. Such a constraint could be
expressed as a symlink for easy visualization/discoverability from
userspace, but might be easier to serialize over the wire as the
(major, minor) pair. I'm not clear which direction is better to
express this either: As a link from heap->device, or device->heap.

     A constraint-based system would also, I think, be easier to extend
     with additional constraints in the future.

- I assume some drivers will be able to support multiple heaps. How do
     you envision this being implemented ?

I don't really see an use case for this.

One use case I know of here is same-vendor GPU local memory on
different GPUs. NVIDIA GPUs have certain things they can only do on
local memory, certain things they can do on all memory, and certain
things they can only do on memory local to another NVIDIA GPU,
especially when there exists an NVLink interface between the two. So
they'd ideally express different constraints for heap representing
each of those.

I strongly think that exposing this complexity is overkill. We have
pretty much the same on AMD GPUs with XGMI, but this is so vendor
specific that I'm pretty sure we shouldn't have that in a common framework.

I disagree with this. That works for cases where all the devices in 
question for a given operation are owned by one vendor, but falls apart 
as soon as you want to try to allocate memory that works with some USB 
camera as well. Then you've no way to express that, or to even know what 
component to ask.

We should concentrate on solving the problem at hand and not trying to
come up with something to complex to be implementable by everybody.
Extensibility is the key here not getting everything handled in the
initial code drop.

I agree with the part about solving the problems at hand, and don't see 
any harm in adding a hint as you propose as an intermediate step.

However, I think it's worth keeping the harder problems in mind, if only 
to guide our interim solutions and to help us better understand what 
kind of extensibility might be needed to enable future solutions.

Furthermore, while I know many disagree and it has its own risks, I 
think it's often worth solving problems no one has brought to you yet. 
When solutions are developed in as general a manner as possible, rather 
than focusing on a particular use case, it enables others to innovate 
and build as-yet unimagined things without being blocked by a subpar 
interface in some level of the stack where they have no expertise. I've 
been bitten time and again by taking a shortcut because I thought it 
would never affect anyone, only to have someone file a bug a few years 
later explaining why they need the exact thing I left out.

The same thing is often true of memory on remote devices that are at
various points in a PCIe topology. We've had situations where we could
only get enough bandwidth between two PCIe devices when they were less
than some number of hops away on the PCI tree. We hard-coded logic to
detect that in our userspace drivers, but we could instead expose it
as a constraint on heaps that would express which devices can
accomplish certain operations as pairs.

Similarly to the last one, I would assume (But haven't yet run into in
my personal experience) similar limitations arise when you have a NUMA
memory configuration, if you had a heap representing each NUMA node or
something, some might have more coherency than others, or might have
different bandwidth limitations that you could express through
something like device tree, etc. This is more speculative, but seems
like a generalization of the above two cases.

We do have some drivers which say: for this use case you can use
whatever you want, but for that use case you need to use specific
memory
(scan out on GPUs for example works like this).

But those specific use cases are exactly that, very specific. And
exposing all the constrains for them inside a kernel UAPI is a futile
effort (at least for the GPU scan out case). In those situations it's
just better to have the allocator in userspace deal with device
specific
stuff.

While the exact constraints will certainly be device-specific, is that
also true of the type of constraints, or the existence of constraints in
the first place ? To give an example, with a video decoder producing
frames that are then rendered by a GPU, the tiling format that would
produce the best result is device-specific, but the fact that the
decoder can produce a tiled format that would work better for the GPU,
or a non-tiled format for other consumers, is a very common constraint.
I don't think we'll be able to do away completely with the
device-specific code in userspace, but I think we should be able to
expose constraints in a generic-enough way that many simple use cases
will be covered by generic code.

Yes, agreed, the design we proposed took pains to allow
vendor-specific constraints via a general mechanism. We supported both
vendor-specific types of constraints, and vendor-specific values for
general constraints. Some code repository would act as the central
registry of constraint types, similar to the Linux kernel's
drm_fourcc.h for modifiers, or the Khronos github repository for
Vulkan vendor IDs. If the definition needs to be used by the kernel,
the kernel is the logical repository for that role IMHO.

In our 2020 discussion, there was some debate over whether the kernel
should expose and/or consume constraints directly, or whether it's
sufficient to expose lower-level mechanisms from the kernel and keep
the translation of constraints to the correct mechanism in userspace.
There are pros/cons to both. I don't know that we bottomed out on that
part of the discussion, and it could be the right answer is some
combination of the two, as suggested below.

What I want to do is to separate the problem. The kernel only provides
the information where to allocate from, figuring out the details like
how many bytes, which format, plane layout etc.. is still the job of
userspace.

Even with UVC, where to allocate memory from will depend on the use
case. If the consumer is a device that doesn't support non-contiguous
DMA, the system heap won't work.

Actually, could you explain why UVC works better with the system heap ?
I'm looking at videobuf2 as an importer, and it doesn't call the dmabuf
as far as I can tell, so cache management provided by the exporter seems
to be bypassed in any case.

What we do have is compatibility between heaps. E.g. a CMA heap is
usually compatible with the system heap or might even be a subset of
another CMA heap. But I wanted to add that as next step to the heaps
framework itself.

- Devices could have different constraints based on particular
     configurations. For instance, a device may require specific memory
     layout for multi-planar YUV formats only (as in allocating the
Y and C
     planes of NV12 from different memory banks). A dynamic API may
thus be
     needed (but may also be very painful to use from userspace).

Uff, good to know. But I'm not sure how to expose stuff like that.

Let's see if James has anything to share with us :-) With a bit of luck
we won't have to start from scratch.

Well, this is the hard example we keep using as a measure of success
for whatever we come up with. I don't know that someone ever sat down
and tried to express this in the mechanism Simon and I proposed in
2020, but allowing the expression of something that complex was
certainly our goal. How to resolve it down to an allocation mechanism,
I believe, was further than we got, but we weren't that well versed in
DMA heaps back then, or at least I wasn't.

What's still missing is certainly matching userspace for this since I
wanted to discuss the initial kernel approach first.

https://git.libcamera.org/libcamera/libcamera.git/ would be a good
place
to prototype userspace support :-)

Thanks for the pointer and the review,

By the way, side question, does anyone know what the status of dma heaps
support is in major distributions ? On my Gentoo box,
/dev/dma_heap/system is 0600 root:root. That's easy to change for a
developer, but not friendly to end-users. If we want to move forward
with dma heaps as standard multimedia allocators (and I would really
like to see that happening), we have to make sure they can be used.

We seem to have reached a world where display (primary nodes) are
carefully guarded, and some mildly trusted group of folks (video) can
access render nodes, but then there's some separate group generally
for camera/video/v4l and whatnot from what I've seen (I don't survey
this stuff that often. I live in my developer bubble). I'm curious
whether the right direction is a broader group that encompasses all of
render nodes, multimedia, and heaps, or if a more segmented design is
right. The latter is probably the better design from first principles,
but might lead to headaches if the permissions diverge.

The main argument is that this memory is not properly accounted, but
this also counts for things like memory file descriptors returned by
memfd_create().

I have proposed multiple times now that we extend the OOM handling to
take memory allocated through file descriptors into account as well, but
I can't find the time to fully upstream this.

T.J. Mercier is working on some memcg based tracking which sounds like
it might resolve this problem as well.

Gosh I hope so. How Android currently does this is with heavy use of
sepolicy to control access to the individual heaps, sometimes even at
a per-application/service level:

raven:/dev/dma_heap # ls -lZ
total 0
cr--r--r-- 1 system audio    u:object_r:dmabuf_heap_device:s0
       248,  15 2023-01-23 16:14 aaudio_capture_heap
cr--r--r-- 1 system audio    u:object_r:dmabuf_heap_device:s0
       248,  14 2023-01-23 16:14 aaudio_playback_heap
cr--r--r-- 1 system graphics u:object_r:faceauth_heap_device:s0
       248,   3 2023-01-23 16:14 faceauth_tpu-secure
cr--r--r-- 1 system graphics u:object_r:faceauth_heap_device:s0
       248,   4 2023-01-23 16:14 faimg-secure
cr--r--r-- 1 system graphics u:object_r:faceauth_heap_device:s0
       248,   7 2023-01-23 16:14 famodel-secure
cr--r--r-- 1 system graphics u:object_r:faceauth_heap_device:s0
       248,   6 2023-01-23 16:14 faprev-secure
cr--r--r-- 1 system graphics u:object_r:faceauth_heap_device:s0
       248,   5 2023-01-23 16:14 farawimg-secure
cr--r--r-- 1 system graphics u:object_r:sensor_direct_heap_device:s0
       248,  13 2023-01-23 16:14 sensor_direct_heap
cr--r--r-- 1 system system   u:object_r:dmabuf_system_heap_device:s0
       248,   9 2023-01-23 16:14 system
cr--r--r-- 1 system system   u:object_r:dmabuf_system_heap_device:s0
       248,  10 2023-01-23 16:14 system-uncached
cr--r--r-- 1 system graphics u:object_r:dmabuf_heap_device:s0
       248,   8 2023-01-23 16:14 tui-secure
cr--r--r-- 1 system drmrpc
u:object_r:dmabuf_system_secure_heap_device:s0  248,   1 2023-01-23
16:14 vframe-secure
cr--r--r-- 1 system drmrpc   u:object_r:dmabuf_heap_device:s0
       248,  11 2023-01-23 16:14 video_system
cr--r--r-- 1 system drmrpc   u:object_r:dmabuf_heap_device:s0
       248,  12 2023-01-23 16:14 video_system-uncached
cr--r--r-- 1 system graphics u:object_r:vscaler_heap_device:s0
       248,   2 2023-01-23 16:14 vscaler-secure
cr--r--r-- 1 system drmrpc
u:object_r:dmabuf_system_secure_heap_device:s0  248,   0 2023-01-23
16:14 vstream-secure

I hope we can get to a point where we don't actually need to protect
anything but the unicorns. One of my earlier attempts involved a
counter for each heap that would allow you to limit each one
individually, but that's not what's proposed now.

Glad someone's minding the details here. This sounds interesting.

Thanks,
-James

Regards,
Christian.

Thanks,
-James

Please take a look and comment.

Thanks,
Christian.

[1]
https://lore.kernel.org/all/11a6f97c-e45f-f24b-8a73-48d5a388a2cc@xxxxxxxxx/T/