Re: Try to address the DMA-buf coherency problem

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Tue, 1 Nov 2022 18:40:43 +0100

Am 28.10.22 um 20:47 schrieb Daniel Stone:
Hi Christian,

On Fri, 28 Oct 2022 at 18:50, Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:
Though, its not generically possible to reverse these roles. If you want to do
so, you endup having to do like Android (gralloc) and ChromeOS (minigbm),
because you will have to allocate DRM buffers that knows about importer specific
requirements. See link [1] for what it looks like for RK3399, with Motion Vector
size calculation copied from the kernel driver into a userspace lib (arguably
that was available from V4L2 sizeimage, but this is technically difficult to
communicate within the software layers). If you could let the decoder export
(with proper cache management) the non-generic code would not be needed.
Yeah, but I can also reverse the argument:

Getting the parameters for V4L right so that we can share the image is
tricky, but getting the parameters so that the stuff is actually
directly displayable by GPUs is even trickier.

Essentially you need to look at both sides and interference to get to a
common ground, e.g. alignment, pitch, width/height, padding, etc.....

Deciding from which side to allocate from is just one step in this
process. For example most dGPUs can't display directly from system
memory altogether, but it is possible to allocate the DMA-buf through
the GPU driver and then write into device memory with P2P PCI transfers.

So as far as I can see switching importer and exporter roles and even
having performant extra fallbacks should be a standard feature of userspace.

Another case where reversing the role is difficult is for case where you need to
multiplex the streams (let's use a camera to illustrate) and share that with
multiple processes. In these uses case, the DRM importers are volatile, which
one do you abuse to do allocation from ? In multimedia server like PipeWire, you
are not really aware if the camera will be used by DRM or not, and if something
"special" is needed in term of role inversion. It is relatively easy to deal
with matching modifiers, but using downstream (display/gpu) as an exporter is
always difficult (and require some level of abuse and guessing).
Oh, very good point! Yeah we do have use cases for this where an input
buffer is both displayed as well as encoded.
This is the main issue, yeah.

For a standard media player, they would try to allocate through V4L2
and decode through that into locally-allocated buffers. All they know
is that there's a Wayland server at the other end of a socket
somewhere which will want to import the FD. The server does give you
some hints along the way: it will tell you that importing into a
particular GPU target device is necessary as the ultimate fallback,
and importing into a particular KMS device is preferable as the
optimal path to hit an overlay.

So let's say that the V4L2 client does what you're proposing: it
allocates a buffer chain, schedules a decode into that buffer, and
passes it along to the server to import. The server fails to import
the buffer into the GPU, and tells the client this. The client then
... well, it doesn't know that it needs to allocate within the GPU
instead, but it knows that doing so might be one thing which would
make the request succeed.

But the client is just a video player. It doesn't understand how to
allocate BOs for Panfrost or AMD or etnaviv. So without a universal
allocator (again ...), 'just allocate on the GPU' isn't a useful
response to the client.

Well exactly that's the point I'm raising: The client *must* understand 
that!

See we need to be able to handle all restrictions here, coherency of the 
data is just one of them.

For example the much more important question is the location of the data 
and for this allocating from the V4L2 device is in most cases just not 
going to fly.

The more common case is that you need to allocate from the GPU and then 
import that into the V4L2 device. The background is that all dGPUs I 
know of need the data inside local memory (VRAM) to be able to scan out 
from it.

I fully understand your point about APIs like Vulkan not sensibly
allowing bracketing, and that's fine. On the other hand, a lot of
extant usecases (camera/codec -> GPU/display, GPU -> codec, etc) on
Arm just cannot fulfill complete coherency. On a lot of these
platforms, despite what you might think about the CPU/GPU
capabilities, the bottleneck is _always_ memory bandwidth, so
mandating extra copies is an absolute non-starter, and would instantly
cripple billions of devices. Lucas has been pretty gentle, but to be
more clear, this is not an option and won't be for at least the next
decade.

Well x86 pretty much has the same restrictions.

For example the scanout buffer is usually always in local memory because 
you often scan out at up to 120Hz while your recording is only 30fps and 
most of the time lower resolution.

Pumping all that data 120 time a second over the PCIe bus would just not 
be doable in a lot of use cases.

So we obviously need a third way at this point, because 'all devices
must always be coherent' vs. 'cache must be an unknown' can't work.
How about this as a suggestion: we have some unused flags in the PRIME
ioctls. Can we add a flag for 'import must be coherent'?

That's pretty much exactly what my patch set does. It just keeps 
userspace out of the way and says that creating the initial connection 
between the devices fails if they can't talk directly with each other.

Maybe we should move that into userspace so that the involved components 
know of hand that a certain approach won't work?

That flag wouldn't be set for the existing ecosystem
Lucas/Nicolas/myself are talking about, where we have explicit
handover points and users are fully able to perform cache maintenance.
For newer APIs where it's not possible to properly express that
bracketing, they would always set that flag (unless we add an API
carve-out where the client promises to do whatever is required to
maintain that).

Would that be viable?

No, as I said. Explicit handover points are just an absolutely no-go. We 
just have way to many use cases which don't work with that idea.

As I said we made the same mistake with the DMA-Api and even more 20 
years later are still running into problems because of that.

Just try to run any dGPU under a XEN hypervisor with memory 
fragmentation for a very good example why this is such a bad idea.

Regards,
Christian.

Cheers,
Daniel