On 08/02/24 06:39, Pekka Paalanen wrote: > On Wed, 7 Feb 2024 16:49:56 +0100 > Louis Chauvet <louis.chauvet@xxxxxxxxxxx> wrote: > >> Hello Pekka, Arthur, Maxime, > > Hi all > >>>>>>>>>>> Change the composition algorithm to iterate over pixels instead of lines. >>>>>>>>>>> It allows a simpler management of rotation and pixel access for complex formats. >>>>>>>>>>> >>>>>>>>>>> This new algorithm allows read_pixel function to have access to x/y >>>>>>>>>>> coordinates and make it possible to read the correct thing in a block >>>>>>>>>>> when block_w and block_h are not 1. >>>>>>>>>>> The iteration pixel-by-pixel in the same method also allows a simpler >>>>>>>>>>> management of rotation with drm_rect_* helpers. This way it's not needed >>>>>>>>>>> anymore to have misterious switch-case distributed in multiple places. >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> there was a very good reason to write this code using lines: >>>>>>>>>> performance. Before lines, it was indeed operating on individual pixels. >>>>>>>>>> >>>>>>>>>> Please, include performance measurements before and after this series >>>>>>>>>> to quantify the impact on the previously already supported pixel >>>>>>>>>> formats, particularly the 32-bit-per-pixel RGB variants. >>>>>>>>>> >>>>>>>>>> VKMS will be used more and more in CI for userspace projects, and >>>>>>>>>> performance actually matters there. >>>>>>>>>> >>>>>>>>>> I'm worrying that this performance degradation here is significant. I >>>>>>>>>> believe it is possible to keep blending with lines, if you add new line >>>>>>>>>> getters for reading from rotated, sub-sampled etc. images. That way you >>>>>>>>>> don't have to regress the most common formats' performance. >> >> I tested, and yes, it's significant for most of the tests. None of them >> timed out on my machine, but I agree that I have to improve this. Do you >> know which tests are the more "heavy"? > > I don't, but considering that various userspace projects (e.g. Wayland > compositors) want to use VKMS more and more in their own CI, looking > only at IGT is not enough. Every second saved per run is a tiny bit of > data center energy saved, or developers waiting less for results. > > I do have some expectations that for each KMS property, Wayland > compositors tend to use the "normal" property value more than any other > value. So if you test different pixel formats, you probably set > rotation to normal, since it's completely orthogonal in userspace. And > then you would test different rotations with just one pixel format. > > At least I would personally leave it to IGT to test all the possible > combinations of pixel formats + rotations + odd sizes + odd positions. > Wayland compositor CI wants to test the compositor internals, not VKMS > internals. > >>>>>>>>> While I understand performance is important and should be taken into >>>>>>>>> account seriously, I cannot understand how broken testing could be >>>>>>>>> considered better. Fast but inaccurate will always be significantly >>>>>>>>> less attractive to my eyes. >>>>>>>> >>>>>>>> AFAIK, neither the cover letter nor the commit log claimed it was fixing >>>>>>>> something broken, just that it was "better" (according to what >>>>>>>> criteria?). >> >> Sorry Maxime for this little missunderstanding, I will improve the commit >> message and cover letter for the v2. >> >>>>>>> Today's RGB implementation is only optimized in the line-by-line case >>>>>>> when there is no rotation. The logic is bit convoluted and may possibly >>>>>>> be slightly clarified with a per-format read_line() implementation, >>>>>>> at a very light performance cost. Such an improvement would definitely >>>>>>> benefit to the clarity of the code, especially when transformations >>>>>>> (especially the rotations) come into play because they would be clearly >>>>>>> handled differently instead of being "hidden" in the optimized logic. >>>>>>> Performances would not change much as this path is not optimized today >>>>>>> anyway (the pixel-oriented logic is already used in the rotation case). >> >> [...] >> >>>>>> I think it would, if I understand what you mean. Ever since I proposed >>>>>> a line-by-line algorithm to improve the performance, I was thinking of >>>>>> per-format read_line() functions that would be selected outside of any >>>>>> loops. >> >> [...] >> >>>>>> I haven't looked at VKMS in a long time, and I am disappointed to find >>>>>> that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel. >>>>>> The reading vfunc should be called with many pixels at a time when the >>>>>> source FB layout allows it. The whole point of the line-based functions >>>>>> was that they repeat the innermost loop in every function body to make >>>>>> the per-pixel overhead as small as possible. The VKMS implementations >>>>>> benchmarked before and after the original line-based algorithm showed >>>>>> that calling a function pointer per-pixel is relatively very expensive. >>>>>> Or maybe it was a switch-case. >> >> [...] >> >>>>> But, I agree with Miquel that the rotation logic is easier to implement >>>>> in a pixel-based way. So going pixel-by-pixel only when rotation occurs >>>>> would be great. >>>> >>>> Yes, and I think that can very well be done in the line-based framework >>>> still that existed in the old days before any rotation support was >>>> added. Essentially a plug-in line-getter function that then calls a >>>> format-specific line-getter pixel-by-pixel while applying the rotation. >>>> It would be simple, it would leave unrotated performance unharmed (use >>>> format-specific line-getter directly with lines), but it might be >>>> somewhat less performant for rotated KMS planes. I suspect that might >>>> be a good compromise. >>>> >>>> Format-specific line-getters could also be parameterized by >>>> pixel-to-pixel offset in bytes. Then they could directly traverse FB >>>> rows forward and backward, and even FB columns. It may or may not have >>>> a penalty compared to the original line-getters, so it would have to >>>> be benchmarked. >>> >>> Oh, actually, since the byte offset depends on format, it might be >>> better to parametrize by direction and compute the offset in the >>> format-specific line-getter before the loop. >>> >> >> I'm currently working on this implementation. The algorithm would look >> like: >> >> void blend(...) { >> for(int y = 0; y < height; y++) { >> for(int plane = 0; plane < nb_planes; plane++) { >> if(planes[plane].read_line && planes[plane].rotation == DRM_ROTATION_0) { > > I would try to drop the rotation check here completely. Instead, when > choosing the function pointer to call here, outside of *all* loops, you > would check the rotation property. If rotation is a no-op, pick the > read_line function directly. If rotation/reflection is needed, pick a > rotation function that will then call read_line function pixel-by-pixel. > > So planes[plane] would have two vfuncs, one with a plain read_line that > assumes normal orientation and can return a line of arbitrary length > from arbitrary x,y position, and another vfunc that this loop here will > call which is either some rotation handling function or just the same > function as the first vfunc. > > The two function pointers might well need different signatures, meaning > you need a simple wrapper for the rotation=normal case too. > > I believe that could result in cleaner code. > >> [...] /* Small common logic to manage REFLECT_X/Y and translations */ >> planes[plane].read_line(....); >> } else { >> [...] /* v1 of my patch, pixel by pixel read */ >> } >> } >> } >> } >> >> where read_line is: >> void read_line(frame_info *src, int y, int x_start, int x_stop, pixel_argb16 *dts[]) >> - y is the line to read (so the caller need to compute the correct offset) >> - x_start/x_stop are the start and stop column, but they may be not >> ordered properly (i.e DRM_REFLECT_X will make x_start greater than >> x_stop) >> - src/dst are source and destination buffers > > This sounds ok. An alternative would be something like > > enum direction { > RIGHT, > LEFT, > UP, > DOWN, > }; > > void read_line(frame_info *src, int start_x, int start_y, enum direction dir, > int count_pixels, pixel_argb16 *dst); > > Based on dir, before the inner loop this function would compute the > byte offset between the pixels to be read. If the format is multiplanar > YUV, it can compute the offset per plane. And the starting pointers per > pixel plane, of course, and one end pointer for the loop stop condition > maybe from dst. > > This might make all the other directions than RIGHT much faster than > calling read_line one pixel at a time to achieve the same. > > Would need to benchmark if this is significantly slower than your > suggestion for dir=RIGHT, though. If it's roughly the same, then it > would probably be worth it. > > >> This way: >> - It's simple to read for the general case (usage of drm_rect_* instead of >> manually rewriting the logic) >> - Each pixel format can be quickly implemented with "pixel-by-pixel" >> methods >> - The performances should be good if no rotation is implied for some >> formats >> >> I also created some helpers for conversions between formats to avoid code >> duplication between pixel and line algorithms (and also between argb and >> xrgb variants). >> >> The only flaw with this, is that most of the read_line functions will >> look like: >> >> void read_line(...) { >> int increment = x_start < x_stop ? 1: -1; >> while(x_start != x_stop) { >> out += 1; >> [...] /* color conversion */ >> x_start += increment; >> } >> } >> >> But as Pekka explained, it's probably the most efficient way to do it. > > Yes, I expect them to look roughly like that. It's necessary for moving > as much of the setup computations and real function calls out of the > inner-most loop as possible. The middle (over KMS planes) and outer > (over y) loops are less sensitive to wasted cycles on redundant > computations. > >> Is there a way to save the output of vkms to a folder (something like >> "one image per frame")? It's a bit complex to debug graphics without >> seeing them. >> >> I have something which (I think) should work, but some tests are failing >> and I don't find why in my code (I don't find the reason why the they are >> failing and the hexdump I added to debug seems normal). >> >> I think my issue is a simple mistake in the "translation managment" or >> maybe just a wrong offset, but I don't see my error in the code. I think a >> quick look on the final image would help me a lot. > > I don't know anything about the kernel unit testing frameworks, maybe > they could help? > > But if you drive the test through UAPI from userspace, use a writeback > connector to fetch the resulting image. I'm sure IGT has code doing > that somewhere, although many IGT tests rely on CRC instead because CRC > is more widely available in hardware. > > Arthur's new benchmark seems to be using writeback, you just need to > make it save to file: > https://lore.kernel.org/all/20240207-bench-v1-1-7135ad426860@xxxxxxxxxx/ Hi, I just found that the IGT's test kms_writeback has a flag '--dump' that does that, maybe you can use it or copy the logic to the benchmark. Best Regards, ~Arthur Grillo > > > Thanks, > pq