On Wed, 10 Nov 2021 13:56:54 -0300 Igor Torrente <igormtorrente@xxxxxxxxx> wrote: > On Tue, Nov 9, 2021 at 8:40 AM Pekka Paalanen <ppaalanen@xxxxxxxxx> wrote: > > > > Hi Igor, > > > > again, that is a really nice speed-up. Unfortunately, I find the code > > rather messy and hard to follow. I hope my comments below help with > > re-designing it to be easier to understand. > > > > > > On Tue, 26 Oct 2021 08:34:06 -0300 > > Igor Torrente <igormtorrente@xxxxxxxxx> wrote: > > > > > Currently the blend function only accepts XRGB_8888 and ARGB_8888 > > > as a color input. > > > > > > This patch refactors all the functions related to the plane composition > > > to overcome this limitation. > > > > > > Now the blend function receives a struct `vkms_pixel_composition_functions` > > > containing two handlers. > > > > > > One will generate a buffer of each line of the frame with the pixels > > > converted to ARGB16161616. And the other will take this line buffer, > > > do some computation on it, and store the pixels in the destination. > > > > > > Both the handlers have the same signature. They receive a pointer to > > > the pixels that will be processed(`pixels_addr`), the number of pixels > > > that will be treated(`length`), and the intermediate buffer of the size > > > of a frame line (`line_buffer`). > > > > > > The first function has been totally described previously. > > > > What does this sentence mean? > > In the sentence "One will generate...", I give an overview of the two types of > handlers. And the overview of the first handler describes the full behavior of > it. > > But it doesn't look clear enough, I will improve it in the future. > > > > > > > > > The second is more interesting, as it has to perform two roles depending > > > on where it is called in the code. > > > > > > The first is to convert(if necessary) the data received in the > > > `line_buffer` and write in the memory pointed by `pixels_addr`. > > > > > > The second role is to perform the `alpha_blend`. So, it takes the pixels > > > in the `line_buffer` and `pixels_addr`, executes the blend, and stores > > > the result back to the `pixels_addr`. > > > > > > The per-line implementation was chosen for performance reasons. > > > The per-pixel functions were having performance issues due to indirect > > > function call overhead. > > > > > > The per-line code trades off memory for execution time. The `line_buffer` > > > allows us to diminish the number of function calls. > > > > > > Results in the IGT test `kms_cursor_crc`: > > > > > > | Frametime | > > > |:---------------:|:---------:|:----------:|:--------:| > > > | implmentation | Current | Per-pixel | Per-line | > > > | frametime range | 8~22 ms | 32~56 ms | 6~19 ms | > > > | Average | 10.0 ms | 35.8 ms | 8.6 ms | > > > > > > Reported-by: kernel test robot <lkp@xxxxxxxxx> > > > Signed-off-by: Igor Torrente <igormtorrente@xxxxxxxxx> > > > --- > > > V2: Improves the performance drastically, by perfoming the operations > > > per-line and not per-pixel(Pekka Paalanen). > > > Minor improvements(Pekka Paalanen). > > > --- > > > drivers/gpu/drm/vkms/vkms_composer.c | 321 ++++++++++++++++----------- > > > drivers/gpu/drm/vkms/vkms_formats.h | 155 +++++++++++++ > > > 2 files changed, 342 insertions(+), 134 deletions(-) > > > create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h > > > > > > diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c > > > index 383ca657ddf7..69fe3a89bdc9 100644 > > > --- a/drivers/gpu/drm/vkms/vkms_composer.c > > > +++ b/drivers/gpu/drm/vkms/vkms_composer.c > > > @@ -9,18 +9,26 @@ > > > #include <drm/drm_vblank.h> > > > > > > #include "vkms_drv.h" > > > - > > > -static u32 get_pixel_from_buffer(int x, int y, const u8 *buffer, > > > - const struct vkms_composer *composer) > > > -{ > > > - u32 pixel; > > > - int src_offset = composer->offset + (y * composer->pitch) > > > - + (x * composer->cpp); > > > - > > > - pixel = *(u32 *)&buffer[src_offset]; > > > - > > > - return pixel; > > > -} > > > +#include "vkms_formats.h" > > > + > > > +#define get_output_vkms_composer(buffer_pointer, composer) \ > > > + ((struct vkms_composer) { \ > > > + .fb = &(struct drm_framebuffer) { \ > > > + .format = &(struct drm_format_info) { \ > > > + .format = DRM_FORMAT_ARGB16161616, \ > > > + }, \ > > > > Is that really how one can initialize a drm_format_info? Does that > > struct not have a lot more fields? Shouldn't you call a function to > > look up the proper struct with all fields populated? > > I did this macro to just fill the necessary fields, and add more of them > as necessary. > > I was implementing something very similar to the algorithm that > you described below. So this macro will not exist in the next version. > > > > > > + }, \ > > > + .map[0].vaddr = (buffer_pointer), \ > > > + .src = (composer)->src, \ > > > + .dst = (composer)->dst, \ > > > + .cpp = sizeof(u64), \ > > > + .pitch = drm_rect_width(&(composer)->dst) * sizeof(u64) \ > > > + }) > > > > Why is this a macro rather than a function? > > I don't have a good answer for that. I'm just more used to these kinds of > initializations using macro instead of function. > > > > > > + > > > +struct vkms_pixel_composition_functions { > > > + void (*get_src_line)(void *pixels_addr, int length, u64 *line_buffer); > > > + void (*set_output_line)(void *pixels_addr, int length, u64 *line_buffer); > > > > I would be a little more comfortable if instead of u64 *line_buffer you > > would have something like > > > > struct line_buffer { > > u16 *row; > > size_t nelem; > > } > > > > so that the functions to be plugged into these function pointers could > > assert that you do not accidentally overflow the array (which would > > imply a code bug in kernel). > > > > One could perhaps go even for: > > > > struct line_pixel { > > u16 r, g, b, a; > > }; > > > > struct line_buffer { > > struct line_pixel *row; > > size_t npixels; > > }; > > If we decide to follow this representation, would it be possible > to calculate the crc in the similar way that is being done currently? > > Something like that: > > crc = crc32_le(crc, line_buffer.row, w * sizeof(line_pixel)); Hi Igor, yes. I think the CRC calculated does not need to be reproducible in userspace, so you can very well compute it from the internal intermediate representation. It also does not need to be portable between architectures, AFAIU. > I mean, If the compiler can decide to put a padding somewhere, it > would mess with the crc value. Right? Padding could mess it up, yes. However, I think in kernel it is a convention to define structs (especially UAPI structs but this is not one) such that there is no implicit padding. So there must be some recommended practises on how to achieve and ensure that. The size of struct line_pixel as defined above is 8 bytes which is a "very round" number, and every field has the same type, so there won't be gaps between fields either. So I think the struct should already be fine and have no padding, but how to make sure it is, I'm not sure what you would do in kernel land. In userspace I would put a static assert to ensure that sizeof(struct line_pixel) = 8. That would be enough, because sizeof counts not just internal implicit padding but also the needed size extension for alignment in an array of those. The accumulated size of the fields individually is 8 bytes, so if the struct size is 8, there cannot be padding. > > > > Because as I mention further down, there is no need for the line buffer > > to use an existing DRM pixel format at all. > > > > All this is fine for me. I will change that to the next patch version. > > > > +}; > > > > > > /** > > > * compute_crc - Compute CRC value on output frame > > > @@ -31,179 +39,222 @@ static u32 get_pixel_from_buffer(int x, int y, const u8 *buffer, > > > * returns CRC value computed using crc32 on the visible portion of > > > * the final framebuffer at vaddr_out > > > */ > > > -static uint32_t compute_crc(const u8 *vaddr, > > > +static uint32_t compute_crc(const __le64 *vaddr, > > > const struct vkms_composer *composer) > > > { > > > - int x, y; > > > - u32 crc = 0, pixel = 0; > > > - int x_src = composer->src.x1 >> 16; > > > - int y_src = composer->src.y1 >> 16; > > > - int h_src = drm_rect_height(&composer->src) >> 16; > > > - int w_src = drm_rect_width(&composer->src) >> 16; > > > - > > > - for (y = y_src; y < y_src + h_src; ++y) { > > > - for (x = x_src; x < x_src + w_src; ++x) { > > > - pixel = get_pixel_from_buffer(x, y, vaddr, composer); > > > - crc = crc32_le(crc, (void *)&pixel, sizeof(u32)); > > > - } > > > - } > > > + int h = drm_rect_height(&composer->dst); > > > + int w = drm_rect_width(&composer->dst); > > > > > > - return crc; > > > + return crc32_le(0, (void *)vaddr, w * h * sizeof(u64)); > > > } > > > > > > -static u8 blend_channel(u8 src, u8 dst, u8 alpha) > > > +static __le16 blend_channel(u16 src, u16 dst, u16 alpha) > > > > This function is doing the OVER operation (Porter-Duff classification) > > assuming pre-multiplied alpha. I think the function name should reflect > > that. At the very least it should somehow note pre-multiplied alpha, > > because KMS property "pixel blend mode" can change that. > > The closest that it has is a comment in the alpha_blend function. > > But, aside from that, `pre_mul_channel_blend` look good to you? That would be fine, or just 'blend_premult'. Later it could get two siblings, blend_none and blend_coverage, to match "pixel blend mode" property. > > > > 'alpha' should be named 'src_alpha'. > > > > > { > > > - u32 pre_blend; > > > - u8 new_color; > > > + u64 pre_blend; > > > > I'm not quite sure if u32 would suffice... max value for src is > > 0xffff * src_alpha / 0xffff = src_alpha. Max value for dst is 0xffff. > > I didn't understand this division. What does the second 0xffff represent? src_alpha is u16, so the divisor is the normalising factor. Channel value and src_alpha are u16 which means they are essentially 0.16 fixed point format. If you multiply the two together as u16, the result would be a 0.32 fixed point format in u32. To get back to 0.16 format, you divide by 0xffff. Actually, this should be obvious, I was just thinking about it too complicated. Since src is pre-multiplied, it follows that src <= src_alpha. If you think in real numbers [0.0, 1.0], it should be easy to see. If src > src_alpha, then it would mean that the original straight color value was out of range (greater than 1.0). > > > > > So we have at max > > > > src_alpha * 0xffff + 0xffff * (0xffff - src_alpha) > > > > Each multiplication independently will fit in u32. > > > > Rearranging we get > > > > src_alpha * 0xffff + 0xffff * 0xffff - 0xffff * src_alpha > > > > which equals > > > > 0xffff * 0xffff > > > > which fits in u32 and does not depend on src_alpha. > > > > So unless I made a mistake, looks like u32 should be enough. On 32-bit > > CPUs it should have speed benefits compared to u64. > > > > > + u16 new_color; > > > > > > - pre_blend = (src * 255 + dst * (255 - alpha)); > > > + pre_blend = (src * 0xffff + dst * (0xffff - alpha)); > > > > 'pre_blend' means "before blending" so maybe a better name here as the > > blending is already done. > > > > I don't have a good name right now, but I will think of something. > > > > > > > - /* Faster div by 255 */ > > > - new_color = ((pre_blend + ((pre_blend + 257) >> 8)) >> 8); > > > + new_color = DIV_ROUND_UP(pre_blend, 0xffff); > > > > > > - return new_color; > > > + return cpu_to_le16(new_color); > > > > What's the thing with cpu_to_le16 here? > > > > I think the temporary line buffers could just be using the cpu-native > > u16 type. There is no DRM format code for that, but we don't need one > > either. This format is not for interoperation with anything else, it's > > just internal here, and the main goals with it are precision and speed. > > > > As such, the temporary line buffers could be simply u16 arrays, so you > > don't need to consider the channel packing into a u64. > > > > This wouldn't cause a problem to calculate the crc in BE machines? I don't think so, because userspace cannot expect CRC values to be portable between machines, drivers or display chips. https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#display-crc-support Thanks to Simon Ser for finding that piece of doc. > > > } > > > > > > > > > > > From here on, I will be removing the diff minus lines from the quoted > > code, because these functions are completely new. > > > > > /** > > > * alpha_blend - alpha blending equation > > > > This is specifically the pre-multiplied alpha blending, so reflect that > > in the function name. > > > > OK, I will use `pre_mul_alpha_blend`. Or something similar. > > > > + * @src_composer: source framebuffer's metadata > > > + * @dst_composer: destination framebuffer's metadata > > > + * @y: The y coodinate(heigth) of the line that will be processed > > > + * @line_buffer: The line with the pixels from src_compositor > > > * > > > * blend pixels using premultiplied blend formula. The current DRM assumption > > > * is that pixel color values have been already pre-multiplied with the alpha > > > * channel values. See more drm_plane_create_blend_mode_property(). Also, this > > > * formula assumes a completely opaque background. > > > + * > > > + * For performance reasons this function also fetches the pixels from the > > > + * destination of the frame line y. > > > + * We use the information that one of the source pixels are in the output > > > + * buffer to fetch it here instead of separate function. And because the > > > + * output format is ARGB16161616, we know that they don't need to be > > > + * converted. > > > + * This save us a indirect function call for each line. > > > > I think this paragraph should be obvious from the type of 'line_buffer' > > parameter and that you are blending src into dst. > > > > > */ > > > +static void alpha_blend(void *pixels_addr, int length, u64 *line_buffer) > > > { > > > + __le16 *output_pixel = pixels_addr; > > > > Aren't you supposed to be writing into line_buffer, not into src? > > > > There is something very strange with the logic here. > > > > In fact, the function signature of the blending function is unexpected. > > A blending function should operate on two line_buffers, not what looks > > like arbitrary buffer pixels. > > > > I think you should forget the old code and design these from scratch. > > You would have three different kinds of functions: > > > > - loading: fetch a row from an image and convert into a line buffer > > - blending: take two line buffers and blend them into one of the line > > buffers > > - storing: convert a line buffer and write it into an image row > > > > I would not coerce these three different operations into less than > > three function pointer types. > > > > To actually run a blending operation between source and destination > > images, you would need four function pointers: > > - loader for source (by pixel format) > > - loader for destination (by pixel format) > > - blender (by chosen blending operation) > > - storing for destination (by pixel format) > > > > Function parameter types should make it obvious whether something is an > > image or row in arbitrary format, or a line buffer in the special > > internal format. > > > > Then the algorithm would work roughly like this: > > > > for each plane: > > for each row: > > load source into lb1 > > load destination into lb2 > > blend lb1 into lb2 > > store lb2 into destination > > > > This is not optimal, you see how destination is repeatedly loaded and > > stored for each plane. So you could swap the loops: > > > > allocate lb1, lb2 with destination width > > for each destination row: > > load destination into lb2 > > > > for each plane: > > load source into lb1 > > blend lb1 into lb2 > > > > store lb2 into destination > > I'm doing something very similar right now, based on comments from the > previous emails. It looks very similar to your pseudocode. > > And this solves several weirdnesses of my code that you commented > throughout this review. > > But I made a decision that I would like to hear your thoughts about it. > > Using your variables, instead of storing the lb2 in the destination, > I'm using it to calculate the CRC in the middle of the compositing loop. > And if necessary, storing/converting the lb2 into the wb buffer. > > So the pseudocode looks like that: > > allocate lb1, lb2 with destination width > for each destination row: > load destination into lb2 > > for each plane: > load source into lb1 > blend lb1 into lb2 > > compute crc of lb2 > > if wb pending > convert and store ib2 to wb buffer > > return crc > > With that we avoid the allocation of the full image buffer. Yes, exactly. Sounds good. > > > > Inside the loop over plane, you need to check if the plane overlaps the > > current destination row at all. If not, continue on the next plane. If > > yes, load source into lb1 and compute the offset into lb2 where it > > needs to be blended. > > Thanks for this tip, this is an optimization that, currently, my code doesn't > have. > > > > > Since we don't support scaling yet, lb1 length will never exceed > > destination width, because there is no need to load plane buffer pixels > > we would not be writing out. > > > > Also "load destination into lb2" could be replaced with just "clear > > lb2" is the old destination contents are to be discarded. Then you also > > don't need the function pointer for "loader for destination". > > > > I think you already had all these ideas, just the execution in code got > > really messy somehow. > > > > > + int i; > > > > > > + for (i = 0; i < length; i++) { > > > + u16 src1_a = line_buffer[i] >> 48; > > > + u16 src1_r = (line_buffer[i] >> 32) & 0xffff; > > > + u16 src1_g = (line_buffer[i] >> 16) & 0xffff; > > > + u16 src1_b = line_buffer[i] & 0xffff; > > > > If you used native u16 array for line buffers, all this arithmetic > > would be unnecessary. > > > > > > > > + u16 src2_r = le16_to_cpu(output_pixel[2]); > > > + u16 src2_g = le16_to_cpu(output_pixel[1]); > > > + u16 src2_b = le16_to_cpu(output_pixel[0]); > > > + > > > + output_pixel[0] = blend_channel(src1_b, src2_b, src1_a); > > > + output_pixel[1] = blend_channel(src1_g, src2_g, src1_a); > > > + output_pixel[2] = blend_channel(src1_r, src2_r, src1_a); > > > + output_pixel[3] = 0xffff; > > > + > > > + output_pixel += 4; > > > + } > > > } > > > > > > /** > > > * @src_composer: source framebuffer's metadata > > > + * @dst_composer: destiny framebuffer's metadata > > > + * @funcs: A struct containing all the composition functions(get_src_line, > > > + * and set_output_pixel) > > > + * @line_buffer: The line with the pixels from src_compositor > > > * > > > + * Using the pixel_blend function passed as parameter, this function blends > > > + * all pixels from src plane into a output buffer (with a blend function > > > + * passed as parameter). > > > + * Information of the output buffer is in the dst_composer parameter > > > + * and the source plane in the src_composer. > > > + * The get_src_line will use the src_composer to get the respective line, > > > + * convert, and return it as ARGB_16161616. > > > + * And finally, the blend function will receive the dst_composer, dst_composer, > > > + * the line y coodinate, and the line buffer. Blend all pixels, and store the > > > + * result in the output. > > > * > > > * TODO: completely clear the primary plane (a = 0xff) before starting to blend > > > * pixel color values > > > */ > > > +static void blend(struct vkms_composer *src_composer, > > > struct vkms_composer *dst_composer, > > > + struct vkms_pixel_composition_functions *funcs, > > > + u64 *line_buffer) > > > { > > > + int i, i_dst; > > > > > > int x_src = src_composer->src.x1 >> 16; > > > int y_src = src_composer->src.y1 >> 16; > > > > > > int x_dst = src_composer->dst.x1; > > > int y_dst = src_composer->dst.y1; > > > + > > > int h_dst = drm_rect_height(&src_composer->dst); > > > + int length = drm_rect_width(&src_composer->dst); > > > > > > int y_limit = y_src + h_dst; > > > + > > > + u8 *src_pixels = packed_pixels_addr(src_composer, x_src, y_src); > > > + u8 *dst_pixels = packed_pixels_addr(dst_composer, x_dst, y_dst); > > > + > > > + int src_next_line_offset = src_composer->pitch; > > > + int dst_next_line_offset = dst_composer->pitch; > > > + > > > + for (i = y_src, i_dst = y_dst; i < y_limit; ++i, i_dst++) { > > > + funcs->get_src_line(src_pixels, length, line_buffer); > > > + funcs->set_output_line(dst_pixels, length, line_buffer); > > > + src_pixels += src_next_line_offset; > > > + dst_pixels += dst_next_line_offset; > > > } > > > } > > > > > > +static void ((*get_line_fmt_transform_function(u32 format)) > > > + (void *pixels_addr, int length, u64 *line_buffer)) > > > { > > > + if (format == DRM_FORMAT_ARGB8888) > > > + return &ARGB8888_to_ARGB16161616; > > > + else if (format == DRM_FORMAT_ARGB16161616) > > > + return &get_ARGB16161616; > > > + else > > > + return &XRGB8888_to_ARGB16161616; > > > +} > > > > > > +static void ((*get_output_line_function(u32 format)) > > > + (void *pixels_addr, int length, u64 *line_buffer)) > > > +{ > > > + if (format == DRM_FORMAT_ARGB8888) > > > + return &convert_to_ARGB8888; > > > + else if (format == DRM_FORMAT_ARGB16161616) > > > + return &convert_to_ARGB16161616; > > > + else > > > + return &convert_to_XRGB8888; > > > +} > > > > > > +static void compose_plane(struct vkms_composer *src_composer, > > > + struct vkms_composer *dst_composer, > > > > I'm confused by the vkms_composer concept. If there is a separate thing > > for source and destination and they are used together, then I don't > > think that thing is a "composer" but some kind of... image structure? > > I didn't create this struct, but I think this is exactly what it represents. > > > "Composer" is what compose_active_planes() does. > > Do you think this struct needs a rename? In the long run, yes. > > > > > + struct vkms_pixel_composition_functions *funcs, > > > + u64 *line_buffer) > > > +{ > > > + u32 src_format = src_composer->fb->format->format; > > > > > > + funcs->get_src_line = get_line_fmt_transform_function(src_format); > > > > > > + blend(src_composer, dst_composer, funcs, line_buffer); > > > > This function is confusing. You get 'funcs' as argument, but you > > overwrite one field and then trust that the other field was already set > > by the caller. The policy of how 'funcs' argument here works is too > > complicated to me. > > > > If you need just one function pointer as argument, then do exactly > > that, and construct the vfunc struct inside this function. > > I think this will be totally solved with the code redesign. I think so too. ... > > > diff --git a/drivers/gpu/drm/vkms/vkms_formats.h b/drivers/gpu/drm/vkms/vkms_formats.h > > > new file mode 100644 > > > index 000000000000..5b850fce69f3 > > > --- /dev/null > > > +++ b/drivers/gpu/drm/vkms/vkms_formats.h > > > @@ -0,0 +1,155 @@ > > > +/* SPDX-License-Identifier: GPL-2.0+ */ > > > + > > > +#ifndef _VKMS_FORMATS_H_ > > > +#define _VKMS_FORMATS_H_ > > > + > > > +#include <drm/drm_rect.h> > > > + > > > +#define pixel_offset(composer, x, y) \ > > > + ((composer)->offset + ((y) * (composer)->pitch) + ((x) * (composer)->cpp)) > > > > Why macro instead of a static inline function? > > Again, I don't have a good answer for that :( I would recommend to use a static inline function always when possible, and macros only when an inline function cannot work. The reason is that an inline function has types in its signature so you get some type safety, and it cannot accidentally mess up other variables in the call sites. A function also cannot "secretly" use variables from the call site like a macro can, so the reader can be sure that the function call will not access anything not listed in the parameters. > > > + > > > +/* > > > + * packed_pixels_addr - Get the pointer to pixel of a given pair of coordinates > > > + * > > > + * @composer: Buffer metadata > > > + * @x: The x(width) coordinate of the 2D buffer > > > + * @y: The y(Heigth) coordinate of the 2D buffer > > > + * > > > + * Takes the information stored in the composer, a pair of coordinates, and > > > + * returns the address of the first color channel. > > > + * This function assumes the channels are packed together, i.e. a color channel > > > + * comes immediately after another. And therefore, this function doesn't work > > > + * for YUV with chroma subsampling (e.g. YUV420 and NV21). > > > + */ > > > +static void *packed_pixels_addr(struct vkms_composer *composer, int x, int y) > > > > Is it normal in the kernel to have non-inline functions in headers? > > > > Actually this file does not look like a header at all, it should > > probably be a .c file and not #included. > > Oops. This should not be that way. I will fix it. While you do that, I wonder if it makes sense to put the functions like get_line_fmt_transform_function() in this file as well, so you only need to expose the getters, and the implementations can remain static functions. Thanks, pq
Attachment:
pgpw0HhDvtl_f.pgp
Description: OpenPGP digital signature