Hi Am 11.09.23 um 10:38 schrieb Pekka Paalanen: [...]
In the kernel, we reduce the copying to the changed parts, if we have damage information from userspace. IDK Mesa's software renderer, but it could certainly apply a similar optimization.I have already assumed that everything uses damage information to optimize the regions to copy. It's still two copies instead of one. Actually, it's slightly more than two copies. Due to damage tracking and the driver needing to flip between two VRAM buffers, it is always copying current + previous damage, not only current damage.
All dumb buffers are in system memory. From there, we always copy into the same region of VRAM. So from the kernel's perspective, it's really just the latest damage.
I suspect the intermediate buffers (dumb buffers in this case) need to be full frame size rather than a scanline, too. I'm not sure why the driver would need any scratch buffers beyond a couple dozen bytes while doing a software conversion, just enough to have the lowest common multiple of 3 bytes and optimal bus write width.For the conversion in the kernel we allocate enough temporary memory to hold the part of each scanline that changed. Then we go over dirty scanlines, convert each into the output format and store it in that temporary buffer. Then copy the temporary buffer to VRAM. The buffer can be reused for the scanlines of a single conversion. But the nature of the framebuffer (buffers are possibly shared with concurrent access from multiple planes) makes it hard to keep that temporary memory around. Hence it's freed after each conversion. The code is at [1].Yes, that's the conversion in the kernel. However, before the kernel can run that conversion, userspace must have already allocated full sized dumb buffers to convert its full sized internal framebuffer into. Since KMS is modernly used with double-buffering, there must always be two dumb buffers, which means updating a dumb buffer needs to copy not just current damage but also previous damage. Userspace has no way to know that single-buffering would be equally good in this case for this particular driver.
I think that this would be a good optimization. If we let userspace know that a certain GEM BO is a shadow buffer, the compositor could avoid that second dumb buffer entirely. We have plenty of DRM drivers that would benefit from this (grep the DRM code for 'shadow_plane').
It would also allow a format conversion in userspace with little memory overhead compared to the two dumb buffers of the current code.
If userspace gave its internal framebuffer to the kernel without doing the conversion to RGB888, then that internal buffer would be the dumb buffer, and there would be no need to allocate a second full size buffer (third, because double-buffering towards KMS). The driver allocating even full scanlines would be a lot less memory touched if userspace didn't convert to RGB888, and if the driver used a tailored conversion function (it literally needs to handle only a single combination of input and output conditions), it wouldn't need even that nor to allocate and free on every conversion. I understand you do not want to pay the cost having such special-case code, and that's ok. It would just be even further optimization after getting rid of a static full sized buffer allocation.I assume that a userspace software renderer could do a much better job at keeping the temporary memory allocated.I'm not sure why the kernel can't keep the temporary scanline buffer allocated with the CRTC. It only needs to be reallocated if it's too small. Sure, people probably don't want to spend time on such code.
I wanted to make that happen at some point. But it's a bit of a hassle. Either we pass a temporary buffer to the kernel's format-conversion code or we store it in one of the involved data structures, preferable drm_framebuffer. The former solution complicates the caller and is prone to memory leaks; the latter is complicated as drm_framebuffer could be used concurrently.
All the above discussion is based on the assumption that dumb buffers are allocated in sysram. It's fine to say you don't want to optimize, but I want to be clear on the exact trade-off. The trade-offs vary greatly depending on each particular use case, which is why I wouldn't make a hard rule of no in-kernel conversions, just a rule of thumb since it's *usually* not useful or is actively hurting. Here we are talking about XRGB8888 which is already exempt from the rule of thumb, with two more special conditions: dumb buffers in sysram, and the performance traits of RGB888 on the specific hardware.
I don't like this optimization from an architectural POV. But I also haven't seen that it makes a difference in practice. As I said in another email, the current XRGB-based code is not too slow to be unusable in the case of a full-screen update. And as we're mostly moving mouse cursors around the screen, screen updates are small most of the time.
Best regards Thomas
Thanks, pqBest regards Thomas [1] https://elixir.bootlin.com/linux/v6.5/source/drivers/gpu/drm/drm_format_helper.c#L88Thanks, pqBest regards ThomasThanks, pq+ * On most hardware, VRAM read access are slow, so when doing the software + * conversion, the dumb buffer should be allocated in system RAM in order to + * have decent performance. + * Extra care should be taken when doing software conversion with + * DRM_CAP_DUMB_PREFER_SHADOW, there are more detailed explanations here: + * https://lore.kernel.org/dri-devel/20230818162415.2185f8e3@eldfell/ */static unsigned int drm_num_planes(struct drm_device *dev)base-commit: 82d750e9d2f5d0594c8f7057ce59127e701af781
-- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Frankenstrasse 146, 90461 Nuernberg, Germany GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman HRB 36809 (AG Nuernberg)
Attachment:
OpenPGP_signature.asc
Description: OpenPGP digital signature