Re: DMA-heap driver hints

"T.J. Mercier" <tjmercier@xxxxxxxxxx> · Tue, 24 Jan 2023 15:14:20 -0800

On Mon, Jan 23, 2023 at 11:49 PM Christian König
<christian.koenig@xxxxxxx> wrote:
>
> Am 24.01.23 um 04:56 schrieb James Jones:
> > On 1/23/23 08:58, Laurent Pinchart wrote:
> >> Hi Christian,
> >>
> >> On Mon, Jan 23, 2023 at 05:29:18PM +0100, Christian König wrote:
> >>> Am 23.01.23 um 14:55 schrieb Laurent Pinchart:
> >>>> Hi Christian,
> >>>>
> >>>> CC'ing James as I think this is related to his work on the unix device
> >>>> memory allocator ([1]).
> >
> > Thank you for including me.
>
> Sorry for not having you in initially. I wasn't aware of your previous
> work in this area.
>
> >
> >>>> [1]
> >>>> https://lore.kernel.org/dri-devel/8b555674-1c5b-c791-4547-2ea7c16aee6c@xxxxxxxxxx/
> >>>>
> >>>> On Mon, Jan 23, 2023 at 01:37:54PM +0100, Christian König wrote:
> >>>>> Hi guys,
> >>>>>
> >>>>> this is just an RFC! The last time we discussed the DMA-buf coherency
> >>>>> problem [1] we concluded that DMA-heap first needs a better way to
> >>>>> communicate to userspace which heap to use for a certain device.
> >>>>>
> >>>>> As far as I know userspace currently just hard codes that information
> >>>>> which is certainly not desirable considering that we should have this
> >>>>> inside the kernel as well.
> >>>>>
> >>>>> So what those two patches here do is to first add some
> >>>>> dma_heap_create_device_link() and dma_heap_remove_device_link()
> >>>>> function and then demonstrating the functionality with uvcvideo
> >>>>> driver.
> >>>>>
> >>>>> The preferred DMA-heap is represented with a symlink in sysfs between
> >>>>> the device and the virtual DMA-heap device node.
> >>>>
> >>>> I'll start with a few high-level comments/questions:
> >>>>
> >>>> - Instead of tying drivers to heaps, have you considered a system
> >>>> where
> >>>>     a driver would expose constraints, and a heap would then be
> >>>> selected
> >>>>     based on those constraints ? A tight coupling between heaps and
> >>>>     drivers means downstream patches to drivers in order to use
> >>>>     vendor-specific heaps, that sounds painful.
> >>>
> >>> I was wondering the same thing as well, but came to the conclusion that
> >>> just the other way around is the less painful approach.
> >>
> >>  From a kernel point of view, sure, it's simpler and thus less painful.
> >>  From the point of view of solving the whole issue, I'm not sure :-)
> >>
> >>> The problem is that there are so many driver specific constrains that I
> >>> don't even know where to start from.
> >>
> >> That's where I was hoping James would have some feedback for us, based
> >> on the work he did on the Unix device memory allocator. If that's not
> >> the case, we can brainstorm this from scratch.
> >
> > Simon Ser's and my presentation from XDC 2020 focused entirely on
> > this. The idea was not to try to enumerate every constraint up front,
> > but rather to develop an extensible mechanism that would be flexible
> > enough to encapsulate many disparate types of constraints and perform
> > set operations on them (merging sets was the only operation we tried
> > to solve). Simon implemented a prototype header-only library to
> > implement the mechanism:
> >
> > https://gitlab.freedesktop.org/emersion/drm-constraints
> >
> > The links to the presentation and talk are below, along with notes
> > from the follow-up workshop.
> >
> > https://lpc.events/event/9/contributions/615/attachments/704/1301/XDC_2020__Allocation_Constraints.pdf
> >
> > https://www.youtube.com/watch?v=HZEClOP5TIk
> > https://paste.sr.ht/~emersion/c43b30be08bab1882f1b107402074462bba3b64a
> >
> > Note one of the hard parts of this was figuring out how to express a
> > device or heap within the constraint structs. One of the better ideas
> > proposed back then was something like heap IDs, where dma heaps would
> > each have one,
>
> We already have that. Each dma_heap has it's own unique name.
>
> > and devices could register their own heaps (or even just themselves?)
> > with the heap subsystem and be assigned a locally-unique ID that
> > userspace could pass around.
>
> I was more considering that we expose some kind of flag noting that a
> certain device needs its buffer allocated from that device to utilize
> all use cases.
>
> > This sounds similar to what you're proposing. Perhaps a reasonable
> > identifier is a device (major, minor) pair. Such a constraint could be
> > expressed as a symlink for easy visualization/discoverability from
> > userspace, but might be easier to serialize over the wire as the
> > (major, minor) pair. I'm not clear which direction is better to
> > express this either: As a link from heap->device, or device->heap.
> >
> >>>>     A constraint-based system would also, I think, be easier to extend
> >>>>     with additional constraints in the future.
> >>>>
> >>>> - I assume some drivers will be able to support multiple heaps. How do
> >>>>     you envision this being implemented ?
> >>>
> >>> I don't really see an use case for this.
> >
> > One use case I know of here is same-vendor GPU local memory on
> > different GPUs. NVIDIA GPUs have certain things they can only do on
> > local memory, certain things they can do on all memory, and certain
> > things they can only do on memory local to another NVIDIA GPU,
> > especially when there exists an NVLink interface between the two. So
> > they'd ideally express different constraints for heap representing
> > each of those.
>
> I strongly think that exposing this complexity is overkill. We have
> pretty much the same on AMD GPUs with XGMI, but this is so vendor
> specific that I'm pretty sure we shouldn't have that in a common framework.
>
> We should concentrate on solving the problem at hand and not trying to
> come up with something to complex to be implementable by everybody.
> Extensibility is the key here not getting everything handled in the
> initial code drop.
>
> >
> > The same thing is often true of memory on remote devices that are at
> > various points in a PCIe topology. We've had situations where we could
> > only get enough bandwidth between two PCIe devices when they were less
> > than some number of hops away on the PCI tree. We hard-coded logic to
> > detect that in our userspace drivers, but we could instead expose it
> > as a constraint on heaps that would express which devices can
> > accomplish certain operations as pairs.
> >
> > Similarly to the last one, I would assume (But haven't yet run into in
> > my personal experience) similar limitations arise when you have a NUMA
> > memory configuration, if you had a heap representing each NUMA node or
> > something, some might have more coherency than others, or might have
> > different bandwidth limitations that you could express through
> > something like device tree, etc. This is more speculative, but seems
> > like a generalization of the above two cases.
> >
> >>> We do have some drivers which say: for this use case you can use
> >>> whatever you want, but for that use case you need to use specific
> >>> memory
> >>> (scan out on GPUs for example works like this).
> >>>
> >>> But those specific use cases are exactly that, very specific. And
> >>> exposing all the constrains for them inside a kernel UAPI is a futile
> >>> effort (at least for the GPU scan out case). In those situations it's
> >>> just better to have the allocator in userspace deal with device
> >>> specific
> >>> stuff.
> >>
> >> While the exact constraints will certainly be device-specific, is that
> >> also true of the type of constraints, or the existence of constraints in
> >> the first place ? To give an example, with a video decoder producing
> >> frames that are then rendered by a GPU, the tiling format that would
> >> produce the best result is device-specific, but the fact that the
> >> decoder can produce a tiled format that would work better for the GPU,
> >> or a non-tiled format for other consumers, is a very common constraint.
> >> I don't think we'll be able to do away completely with the
> >> device-specific code in userspace, but I think we should be able to
> >> expose constraints in a generic-enough way that many simple use cases
> >> will be covered by generic code.
> >
> > Yes, agreed, the design we proposed took pains to allow
> > vendor-specific constraints via a general mechanism. We supported both
> > vendor-specific types of constraints, and vendor-specific values for
> > general constraints. Some code repository would act as the central
> > registry of constraint types, similar to the Linux kernel's
> > drm_fourcc.h for modifiers, or the Khronos github repository for
> > Vulkan vendor IDs. If the definition needs to be used by the kernel,
> > the kernel is the logical repository for that role IMHO.
> >
> > In our 2020 discussion, there was some debate over whether the kernel
> > should expose and/or consume constraints directly, or whether it's
> > sufficient to expose lower-level mechanisms from the kernel and keep
> > the translation of constraints to the correct mechanism in userspace.
> > There are pros/cons to both. I don't know that we bottomed out on that
> > part of the discussion, and it could be the right answer is some
> > combination of the two, as suggested below.
> >
> >>> What I want to do is to separate the problem. The kernel only provides
> >>> the information where to allocate from, figuring out the details like
> >>> how many bytes, which format, plane layout etc.. is still the job of
> >>> userspace.
> >>
> >> Even with UVC, where to allocate memory from will depend on the use
> >> case. If the consumer is a device that doesn't support non-contiguous
> >> DMA, the system heap won't work.
> >>
> >> Actually, could you explain why UVC works better with the system heap ?
> >> I'm looking at videobuf2 as an importer, and it doesn't call the dmabuf
> >> as far as I can tell, so cache management provided by the exporter seems
> >> to be bypassed in any case.
> >>
> >>> What we do have is compatibility between heaps. E.g. a CMA heap is
> >>> usually compatible with the system heap or might even be a subset of
> >>> another CMA heap. But I wanted to add that as next step to the heaps
> >>> framework itself.
> >>>
> >>>> - Devices could have different constraints based on particular
> >>>>     configurations. For instance, a device may require specific memory
> >>>>     layout for multi-planar YUV formats only (as in allocating the
> >>>> Y and C
> >>>>     planes of NV12 from different memory banks). A dynamic API may
> >>>> thus be
> >>>>     needed (but may also be very painful to use from userspace).
> >>>
> >>> Uff, good to know. But I'm not sure how to expose stuff like that.
> >>
> >> Let's see if James has anything to share with us :-) With a bit of luck
> >> we won't have to start from scratch.
> >
> > Well, this is the hard example we keep using as a measure of success
> > for whatever we come up with. I don't know that someone ever sat down
> > and tried to express this in the mechanism Simon and I proposed in
> > 2020, but allowing the expression of something that complex was
> > certainly our goal. How to resolve it down to an allocation mechanism,
> > I believe, was further than we got, but we weren't that well versed in
> > DMA heaps back then, or at least I wasn't.
> >
> >>>>> What's still missing is certainly matching userspace for this since I
> >>>>> wanted to discuss the initial kernel approach first.
> >>>>
> >>>> https://git.libcamera.org/libcamera/libcamera.git/ would be a good
> >>>> place
> >>>> to prototype userspace support :-)
> >>>
> >>> Thanks for the pointer and the review,
> >>
> >> By the way, side question, does anyone know what the status of dma heaps
> >> support is in major distributions ? On my Gentoo box,
> >> /dev/dma_heap/system is 0600 root:root. That's easy to change for a
> >> developer, but not friendly to end-users. If we want to move forward
> >> with dma heaps as standard multimedia allocators (and I would really
> >> like to see that happening), we have to make sure they can be used.
> >
> > We seem to have reached a world where display (primary nodes) are
> > carefully guarded, and some mildly trusted group of folks (video) can
> > access render nodes, but then there's some separate group generally
> > for camera/video/v4l and whatnot from what I've seen (I don't survey
> > this stuff that often. I live in my developer bubble). I'm curious
> > whether the right direction is a broader group that encompasses all of
> > render nodes, multimedia, and heaps, or if a more segmented design is
> > right. The latter is probably the better design from first principles,
> > but might lead to headaches if the permissions diverge.
>
> The main argument is that this memory is not properly accounted, but
> this also counts for things like memory file descriptors returned by
> memfd_create().
>
> I have proposed multiple times now that we extend the OOM handling to
> take memory allocated through file descriptors into account as well, but
> I can't find the time to fully upstream this.
>
> T.J. Mercier is working on some memcg based tracking which sounds like
> it might resolve this problem as well.
>
Gosh I hope so. How Android currently does this is with heavy use of
sepolicy to control access to the individual heaps, sometimes even at
a per-application/service level:

raven:/dev/dma_heap # ls -lZ
total 0
cr--r--r-- 1 system audio    u:object_r:dmabuf_heap_device:s0
      248,  15 2023-01-23 16:14 aaudio_capture_heap
cr--r--r-- 1 system audio    u:object_r:dmabuf_heap_device:s0
      248,  14 2023-01-23 16:14 aaudio_playback_heap
cr--r--r-- 1 system graphics u:object_r:faceauth_heap_device:s0
      248,   3 2023-01-23 16:14 faceauth_tpu-secure
cr--r--r-- 1 system graphics u:object_r:faceauth_heap_device:s0
      248,   4 2023-01-23 16:14 faimg-secure
cr--r--r-- 1 system graphics u:object_r:faceauth_heap_device:s0
      248,   7 2023-01-23 16:14 famodel-secure
cr--r--r-- 1 system graphics u:object_r:faceauth_heap_device:s0
      248,   6 2023-01-23 16:14 faprev-secure
cr--r--r-- 1 system graphics u:object_r:faceauth_heap_device:s0
      248,   5 2023-01-23 16:14 farawimg-secure
cr--r--r-- 1 system graphics u:object_r:sensor_direct_heap_device:s0
      248,  13 2023-01-23 16:14 sensor_direct_heap
cr--r--r-- 1 system system   u:object_r:dmabuf_system_heap_device:s0
      248,   9 2023-01-23 16:14 system
cr--r--r-- 1 system system   u:object_r:dmabuf_system_heap_device:s0
      248,  10 2023-01-23 16:14 system-uncached
cr--r--r-- 1 system graphics u:object_r:dmabuf_heap_device:s0
      248,   8 2023-01-23 16:14 tui-secure
cr--r--r-- 1 system drmrpc
u:object_r:dmabuf_system_secure_heap_device:s0  248,   1 2023-01-23
16:14 vframe-secure
cr--r--r-- 1 system drmrpc   u:object_r:dmabuf_heap_device:s0
      248,  11 2023-01-23 16:14 video_system
cr--r--r-- 1 system drmrpc   u:object_r:dmabuf_heap_device:s0
      248,  12 2023-01-23 16:14 video_system-uncached
cr--r--r-- 1 system graphics u:object_r:vscaler_heap_device:s0
      248,   2 2023-01-23 16:14 vscaler-secure
cr--r--r-- 1 system drmrpc
u:object_r:dmabuf_system_secure_heap_device:s0  248,   0 2023-01-23
16:14 vstream-secure

I hope we can get to a point where we don't actually need to protect
anything but the unicorns. One of my earlier attempts involved a
counter for each heap that would allow you to limit each one
individually, but that's not what's proposed now.

> Regards,
> Christian.
>
>
> >
> > Thanks,
> > -James
> >
> >>>>> Please take a look and comment.
> >>>>>
> >>>>> Thanks,
> >>>>> Christian.
> >>>>>
> >>>>> [1]
> >>>>> https://lore.kernel.org/all/11a6f97c-e45f-f24b-8a73-48d5a388a2cc@xxxxxxxxx/T/
> >>
> >
>