Re: [PATCH v3] drm/plane: Add documentation about software color conversion.

Thomas Zimmermann <tzimmermann@xxxxxxx> · Mon, 11 Sep 2023 12:18:15 +0200

Hi

Am 11.09.23 um 10:38 schrieb Pekka Paalanen:
[...]
In the kernel, we reduce the copying to the changed parts, if we have
damage information from userspace. IDK Mesa's software renderer, but it
could certainly apply a similar optimization.

I have already assumed that everything uses damage information to
optimize the regions to copy. It's still two copies instead of one.
Actually, it's slightly more than two copies.

Due to damage tracking and the driver needing to flip between two VRAM
buffers, it is always copying current + previous damage, not only
current damage.

All dumb buffers are in system memory. From there, we always copy into 
the same region of VRAM. So from the kernel's perspective, it's really 
just the latest damage.

I suspect the intermediate buffers (dumb buffers in this case) need to
be full frame size rather than a scanline, too. I'm not sure why the
driver would need any scratch buffers beyond a couple dozen bytes while
doing a software conversion, just enough to have the lowest common
multiple of 3 bytes and optimal bus write width.

For the conversion in the kernel we allocate enough temporary memory to
hold the part of each scanline that changed. Then we go over dirty
scanlines, convert each into the output format and store it in that
temporary buffer. Then copy the temporary buffer to VRAM. The buffer can
be reused for the scanlines of a single conversion. But the nature of
the framebuffer (buffers are possibly shared with concurrent access from
multiple planes) makes it hard to keep that temporary memory around.
Hence it's freed after each conversion. The code is at [1].

Yes, that's the conversion in the kernel. However, before the kernel
can run that conversion, userspace must have already allocated full
sized dumb buffers to convert its full sized internal framebuffer into.
Since KMS is modernly used with double-buffering, there must always be
two dumb buffers, which means updating a dumb buffer needs to copy not
just current damage but also previous damage. Userspace has no way to
know that single-buffering would be equally good in this case for this
particular driver.

I think that this would be a good optimization. If we let userspace know 
that a certain GEM BO is a shadow buffer, the compositor could avoid 
that second dumb buffer entirely. We have plenty of DRM drivers that 
would benefit from this (grep the DRM code for 'shadow_plane').

It would also allow a format conversion in userspace with little memory 
overhead compared to the two dumb buffers of the current code.

If userspace gave its internal framebuffer to the kernel without doing
the conversion to RGB888, then that internal buffer would be the dumb
buffer, and there would be no need to allocate a second full size
buffer (third, because double-buffering towards KMS).

The driver allocating even full scanlines would be a lot less memory
touched if userspace didn't convert to RGB888, and if the driver used a
tailored conversion function (it literally needs to handle only a single
combination of input and output conditions), it wouldn't need even that
nor to allocate and free on every conversion. I understand you do not
want to pay the cost having such special-case code, and that's ok. It
would just be even further optimization after getting rid of a static
full sized buffer allocation.

I assume that a userspace software renderer could do a much better job
at keeping the temporary memory allocated.

I'm not sure why the kernel can't keep the temporary scanline buffer
allocated with the CRTC. It only needs to be reallocated if it's too
small. Sure, people probably don't want to spend time on such code.

I wanted to make that happen at some point. But it's a bit of a hassle. 
Either we pass a temporary buffer to the kernel's format-conversion code 
or we store it in one of the involved data structures, preferable 
drm_framebuffer.  The former solution complicates the caller and is 
prone to memory leaks; the latter is complicated as drm_framebuffer 
could be used concurrently.

All the above discussion is based on the assumption that dumb buffers
are allocated in sysram.

It's fine to say you don't want to optimize, but I want to be clear on
the exact trade-off.

The trade-offs vary greatly depending on each particular use case,
which is why I wouldn't make a hard rule of no in-kernel conversions,
just a rule of thumb since it's *usually* not useful or is actively
hurting. Here we are talking about XRGB8888 which is already exempt
from the rule of thumb, with two more special conditions: dumb buffers
in sysram, and the performance traits of RGB888 on the specific
hardware.

I don't like this optimization from an architectural POV. But I also 
haven't seen that it makes a difference in practice. As I said in 
another email, the current XRGB-based code is not too slow to be 
unusable in the case of a full-screen update. And as we're mostly moving 
mouse cursors around the screen, screen updates are small most of the time.

Best regards
Thomas

Thanks,
pq

Best regards
Thomas

[1]
https://elixir.bootlin.com/linux/v6.5/source/drivers/gpu/drm/drm_format_helper.c#L88

Thanks,
pq

Best regards
Thomas

Thanks,
pq

+ *     On most hardware, VRAM read access are slow, so when doing the software
+ *     conversion, the dumb buffer should be allocated in system RAM in order to
+ *     have decent performance.
+ *     Extra care should be taken when doing software conversion with
+ *     DRM_CAP_DUMB_PREFER_SHADOW, there are more detailed explanations here:
+ *     https://lore.kernel.org/dri-devel/20230818162415.2185f8e3@eldfell/
      */

     static unsigned int drm_num_planes(struct drm_device *dev)

base-commit: 82d750e9d2f5d0594c8f7057ce59127e701af781

--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstrasse 146, 90461 Nuernberg, Germany
GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman
HRB 36809 (AG Nuernberg)
Attachment:
OpenPGP_signature.asc

Description: OpenPGP digital signature