On Mon, Jan 23, 2023 at 11:49 PM Christian König <christian.koenig@xxxxxxx> wrote: > > Am 24.01.23 um 04:56 schrieb James Jones: > > On 1/23/23 08:58, Laurent Pinchart wrote: > >> Hi Christian, > >> > >> On Mon, Jan 23, 2023 at 05:29:18PM +0100, Christian König wrote: > >>> Am 23.01.23 um 14:55 schrieb Laurent Pinchart: > >>>> Hi Christian, > >>>> > >>>> CC'ing James as I think this is related to his work on the unix device > >>>> memory allocator ([1]). > > > > Thank you for including me. > > Sorry for not having you in initially. I wasn't aware of your previous > work in this area. > > > > >>>> [1] > >>>> https://lore.kernel.org/dri-devel/8b555674-1c5b-c791-4547-2ea7c16aee6c@xxxxxxxxxx/ > >>>> > >>>> On Mon, Jan 23, 2023 at 01:37:54PM +0100, Christian König wrote: > >>>>> Hi guys, > >>>>> > >>>>> this is just an RFC! The last time we discussed the DMA-buf coherency > >>>>> problem [1] we concluded that DMA-heap first needs a better way to > >>>>> communicate to userspace which heap to use for a certain device. > >>>>> > >>>>> As far as I know userspace currently just hard codes that information > >>>>> which is certainly not desirable considering that we should have this > >>>>> inside the kernel as well. > >>>>> > >>>>> So what those two patches here do is to first add some > >>>>> dma_heap_create_device_link() and dma_heap_remove_device_link() > >>>>> function and then demonstrating the functionality with uvcvideo > >>>>> driver. > >>>>> > >>>>> The preferred DMA-heap is represented with a symlink in sysfs between > >>>>> the device and the virtual DMA-heap device node. > >>>> > >>>> I'll start with a few high-level comments/questions: > >>>> > >>>> - Instead of tying drivers to heaps, have you considered a system > >>>> where > >>>> a driver would expose constraints, and a heap would then be > >>>> selected > >>>> based on those constraints ? A tight coupling between heaps and > >>>> drivers means downstream patches to drivers in order to use > >>>> vendor-specific heaps, that sounds painful. > >>> > >>> I was wondering the same thing as well, but came to the conclusion that > >>> just the other way around is the less painful approach. > >> > >> From a kernel point of view, sure, it's simpler and thus less painful. > >> From the point of view of solving the whole issue, I'm not sure :-) > >> > >>> The problem is that there are so many driver specific constrains that I > >>> don't even know where to start from. > >> > >> That's where I was hoping James would have some feedback for us, based > >> on the work he did on the Unix device memory allocator. If that's not > >> the case, we can brainstorm this from scratch. > > > > Simon Ser's and my presentation from XDC 2020 focused entirely on > > this. The idea was not to try to enumerate every constraint up front, > > but rather to develop an extensible mechanism that would be flexible > > enough to encapsulate many disparate types of constraints and perform > > set operations on them (merging sets was the only operation we tried > > to solve). Simon implemented a prototype header-only library to > > implement the mechanism: > > > > https://gitlab.freedesktop.org/emersion/drm-constraints > > > > The links to the presentation and talk are below, along with notes > > from the follow-up workshop. > > > > https://lpc.events/event/9/contributions/615/attachments/704/1301/XDC_2020__Allocation_Constraints.pdf > > > > https://www.youtube.com/watch?v=HZEClOP5TIk > > https://paste.sr.ht/~emersion/c43b30be08bab1882f1b107402074462bba3b64a > > > > Note one of the hard parts of this was figuring out how to express a > > device or heap within the constraint structs. One of the better ideas > > proposed back then was something like heap IDs, where dma heaps would > > each have one, > > We already have that. Each dma_heap has it's own unique name. > > > and devices could register their own heaps (or even just themselves?) > > with the heap subsystem and be assigned a locally-unique ID that > > userspace could pass around. > > I was more considering that we expose some kind of flag noting that a > certain device needs its buffer allocated from that device to utilize > all use cases. > > > This sounds similar to what you're proposing. Perhaps a reasonable > > identifier is a device (major, minor) pair. Such a constraint could be > > expressed as a symlink for easy visualization/discoverability from > > userspace, but might be easier to serialize over the wire as the > > (major, minor) pair. I'm not clear which direction is better to > > express this either: As a link from heap->device, or device->heap. > > > >>>> A constraint-based system would also, I think, be easier to extend > >>>> with additional constraints in the future. > >>>> > >>>> - I assume some drivers will be able to support multiple heaps. How do > >>>> you envision this being implemented ? > >>> > >>> I don't really see an use case for this. > > > > One use case I know of here is same-vendor GPU local memory on > > different GPUs. NVIDIA GPUs have certain things they can only do on > > local memory, certain things they can do on all memory, and certain > > things they can only do on memory local to another NVIDIA GPU, > > especially when there exists an NVLink interface between the two. So > > they'd ideally express different constraints for heap representing > > each of those. > > I strongly think that exposing this complexity is overkill. We have > pretty much the same on AMD GPUs with XGMI, but this is so vendor > specific that I'm pretty sure we shouldn't have that in a common framework. > > We should concentrate on solving the problem at hand and not trying to > come up with something to complex to be implementable by everybody. > Extensibility is the key here not getting everything handled in the > initial code drop. > > > > > The same thing is often true of memory on remote devices that are at > > various points in a PCIe topology. We've had situations where we could > > only get enough bandwidth between two PCIe devices when they were less > > than some number of hops away on the PCI tree. We hard-coded logic to > > detect that in our userspace drivers, but we could instead expose it > > as a constraint on heaps that would express which devices can > > accomplish certain operations as pairs. > > > > Similarly to the last one, I would assume (But haven't yet run into in > > my personal experience) similar limitations arise when you have a NUMA > > memory configuration, if you had a heap representing each NUMA node or > > something, some might have more coherency than others, or might have > > different bandwidth limitations that you could express through > > something like device tree, etc. This is more speculative, but seems > > like a generalization of the above two cases. > > > >>> We do have some drivers which say: for this use case you can use > >>> whatever you want, but for that use case you need to use specific > >>> memory > >>> (scan out on GPUs for example works like this). > >>> > >>> But those specific use cases are exactly that, very specific. And > >>> exposing all the constrains for them inside a kernel UAPI is a futile > >>> effort (at least for the GPU scan out case). In those situations it's > >>> just better to have the allocator in userspace deal with device > >>> specific > >>> stuff. > >> > >> While the exact constraints will certainly be device-specific, is that > >> also true of the type of constraints, or the existence of constraints in > >> the first place ? To give an example, with a video decoder producing > >> frames that are then rendered by a GPU, the tiling format that would > >> produce the best result is device-specific, but the fact that the > >> decoder can produce a tiled format that would work better for the GPU, > >> or a non-tiled format for other consumers, is a very common constraint. > >> I don't think we'll be able to do away completely with the > >> device-specific code in userspace, but I think we should be able to > >> expose constraints in a generic-enough way that many simple use cases > >> will be covered by generic code. > > > > Yes, agreed, the design we proposed took pains to allow > > vendor-specific constraints via a general mechanism. We supported both > > vendor-specific types of constraints, and vendor-specific values for > > general constraints. Some code repository would act as the central > > registry of constraint types, similar to the Linux kernel's > > drm_fourcc.h for modifiers, or the Khronos github repository for > > Vulkan vendor IDs. If the definition needs to be used by the kernel, > > the kernel is the logical repository for that role IMHO. > > > > In our 2020 discussion, there was some debate over whether the kernel > > should expose and/or consume constraints directly, or whether it's > > sufficient to expose lower-level mechanisms from the kernel and keep > > the translation of constraints to the correct mechanism in userspace. > > There are pros/cons to both. I don't know that we bottomed out on that > > part of the discussion, and it could be the right answer is some > > combination of the two, as suggested below. > > > >>> What I want to do is to separate the problem. The kernel only provides > >>> the information where to allocate from, figuring out the details like > >>> how many bytes, which format, plane layout etc.. is still the job of > >>> userspace. > >> > >> Even with UVC, where to allocate memory from will depend on the use > >> case. If the consumer is a device that doesn't support non-contiguous > >> DMA, the system heap won't work. > >> > >> Actually, could you explain why UVC works better with the system heap ? > >> I'm looking at videobuf2 as an importer, and it doesn't call the dmabuf > >> as far as I can tell, so cache management provided by the exporter seems > >> to be bypassed in any case. > >> > >>> What we do have is compatibility between heaps. E.g. a CMA heap is > >>> usually compatible with the system heap or might even be a subset of > >>> another CMA heap. But I wanted to add that as next step to the heaps > >>> framework itself. > >>> > >>>> - Devices could have different constraints based on particular > >>>> configurations. For instance, a device may require specific memory > >>>> layout for multi-planar YUV formats only (as in allocating the > >>>> Y and C > >>>> planes of NV12 from different memory banks). A dynamic API may > >>>> thus be > >>>> needed (but may also be very painful to use from userspace). > >>> > >>> Uff, good to know. But I'm not sure how to expose stuff like that. > >> > >> Let's see if James has anything to share with us :-) With a bit of luck > >> we won't have to start from scratch. > > > > Well, this is the hard example we keep using as a measure of success > > for whatever we come up with. I don't know that someone ever sat down > > and tried to express this in the mechanism Simon and I proposed in > > 2020, but allowing the expression of something that complex was > > certainly our goal. How to resolve it down to an allocation mechanism, > > I believe, was further than we got, but we weren't that well versed in > > DMA heaps back then, or at least I wasn't. > > > >>>>> What's still missing is certainly matching userspace for this since I > >>>>> wanted to discuss the initial kernel approach first. > >>>> > >>>> https://git.libcamera.org/libcamera/libcamera.git/ would be a good > >>>> place > >>>> to prototype userspace support :-) > >>> > >>> Thanks for the pointer and the review, > >> > >> By the way, side question, does anyone know what the status of dma heaps > >> support is in major distributions ? On my Gentoo box, > >> /dev/dma_heap/system is 0600 root:root. That's easy to change for a > >> developer, but not friendly to end-users. If we want to move forward > >> with dma heaps as standard multimedia allocators (and I would really > >> like to see that happening), we have to make sure they can be used. > > > > We seem to have reached a world where display (primary nodes) are > > carefully guarded, and some mildly trusted group of folks (video) can > > access render nodes, but then there's some separate group generally > > for camera/video/v4l and whatnot from what I've seen (I don't survey > > this stuff that often. I live in my developer bubble). I'm curious > > whether the right direction is a broader group that encompasses all of > > render nodes, multimedia, and heaps, or if a more segmented design is > > right. The latter is probably the better design from first principles, > > but might lead to headaches if the permissions diverge. > > The main argument is that this memory is not properly accounted, but > this also counts for things like memory file descriptors returned by > memfd_create(). > > I have proposed multiple times now that we extend the OOM handling to > take memory allocated through file descriptors into account as well, but > I can't find the time to fully upstream this. > > T.J. Mercier is working on some memcg based tracking which sounds like > it might resolve this problem as well. > Gosh I hope so. How Android currently does this is with heavy use of sepolicy to control access to the individual heaps, sometimes even at a per-application/service level: raven:/dev/dma_heap # ls -lZ total 0 cr--r--r-- 1 system audio u:object_r:dmabuf_heap_device:s0 248, 15 2023-01-23 16:14 aaudio_capture_heap cr--r--r-- 1 system audio u:object_r:dmabuf_heap_device:s0 248, 14 2023-01-23 16:14 aaudio_playback_heap cr--r--r-- 1 system graphics u:object_r:faceauth_heap_device:s0 248, 3 2023-01-23 16:14 faceauth_tpu-secure cr--r--r-- 1 system graphics u:object_r:faceauth_heap_device:s0 248, 4 2023-01-23 16:14 faimg-secure cr--r--r-- 1 system graphics u:object_r:faceauth_heap_device:s0 248, 7 2023-01-23 16:14 famodel-secure cr--r--r-- 1 system graphics u:object_r:faceauth_heap_device:s0 248, 6 2023-01-23 16:14 faprev-secure cr--r--r-- 1 system graphics u:object_r:faceauth_heap_device:s0 248, 5 2023-01-23 16:14 farawimg-secure cr--r--r-- 1 system graphics u:object_r:sensor_direct_heap_device:s0 248, 13 2023-01-23 16:14 sensor_direct_heap cr--r--r-- 1 system system u:object_r:dmabuf_system_heap_device:s0 248, 9 2023-01-23 16:14 system cr--r--r-- 1 system system u:object_r:dmabuf_system_heap_device:s0 248, 10 2023-01-23 16:14 system-uncached cr--r--r-- 1 system graphics u:object_r:dmabuf_heap_device:s0 248, 8 2023-01-23 16:14 tui-secure cr--r--r-- 1 system drmrpc u:object_r:dmabuf_system_secure_heap_device:s0 248, 1 2023-01-23 16:14 vframe-secure cr--r--r-- 1 system drmrpc u:object_r:dmabuf_heap_device:s0 248, 11 2023-01-23 16:14 video_system cr--r--r-- 1 system drmrpc u:object_r:dmabuf_heap_device:s0 248, 12 2023-01-23 16:14 video_system-uncached cr--r--r-- 1 system graphics u:object_r:vscaler_heap_device:s0 248, 2 2023-01-23 16:14 vscaler-secure cr--r--r-- 1 system drmrpc u:object_r:dmabuf_system_secure_heap_device:s0 248, 0 2023-01-23 16:14 vstream-secure I hope we can get to a point where we don't actually need to protect anything but the unicorns. One of my earlier attempts involved a counter for each heap that would allow you to limit each one individually, but that's not what's proposed now. > Regards, > Christian. > > > > > > Thanks, > > -James > > > >>>>> Please take a look and comment. > >>>>> > >>>>> Thanks, > >>>>> Christian. > >>>>> > >>>>> [1] > >>>>> https://lore.kernel.org/all/11a6f97c-e45f-f24b-8a73-48d5a388a2cc@xxxxxxxxx/T/ > >> > > >