Hi Pekka, On Thu, Nov 11, 2021 at 11:37 AM Pekka Paalanen <ppaalanen@xxxxxxxxx> wrote: > > On Thu, 11 Nov 2021 11:07:21 -0300 > Igor Torrente <igormtorrente@xxxxxxxxx> wrote: > > > Hi Pekka, > > > > On Thu, Nov 11, 2021 at 6:33 AM Pekka Paalanen <ppaalanen@xxxxxxxxx> wrote: > > > > > > On Wed, 10 Nov 2021 13:56:54 -0300 > > > Igor Torrente <igormtorrente@xxxxxxxxx> wrote: > > > > > > > On Tue, Nov 9, 2021 at 8:40 AM Pekka Paalanen <ppaalanen@xxxxxxxxx> wrote: > > > > > > > > > > Hi Igor, > > > > > > > > > > again, that is a really nice speed-up. Unfortunately, I find the code > > > > > rather messy and hard to follow. I hope my comments below help with > > > > > re-designing it to be easier to understand. > > > > > > > > > > > > > > > On Tue, 26 Oct 2021 08:34:06 -0300 > > > > > Igor Torrente <igormtorrente@xxxxxxxxx> wrote: > > > > > > > > > > > Currently the blend function only accepts XRGB_8888 and ARGB_8888 > > > > > > as a color input. > > > > > > > > > > > > This patch refactors all the functions related to the plane composition > > > > > > to overcome this limitation. > > > > > > > > > > > > Now the blend function receives a struct `vkms_pixel_composition_functions` > > > > > > containing two handlers. > > > > > > > > > > > > One will generate a buffer of each line of the frame with the pixels > > > > > > converted to ARGB16161616. And the other will take this line buffer, > > > > > > do some computation on it, and store the pixels in the destination. > > > > > > > > > > > > Both the handlers have the same signature. They receive a pointer to > > > > > > the pixels that will be processed(`pixels_addr`), the number of pixels > > > > > > that will be treated(`length`), and the intermediate buffer of the size > > > > > > of a frame line (`line_buffer`). > > > > > > > > > > > > The first function has been totally described previously. > > > > > > > > > > What does this sentence mean? > > > > > > > > In the sentence "One will generate...", I give an overview of the two types of > > > > handlers. And the overview of the first handler describes the full behavior of > > > > it. > > > > > > > > But it doesn't look clear enough, I will improve it in the future. > > > > > > > > > > > > > > > > > > > > > The second is more interesting, as it has to perform two roles depending > > > > > > on where it is called in the code. > > > > > > > > > > > > The first is to convert(if necessary) the data received in the > > > > > > `line_buffer` and write in the memory pointed by `pixels_addr`. > > > > > > > > > > > > The second role is to perform the `alpha_blend`. So, it takes the pixels > > > > > > in the `line_buffer` and `pixels_addr`, executes the blend, and stores > > > > > > the result back to the `pixels_addr`. > > > > > > > > > > > > The per-line implementation was chosen for performance reasons. > > > > > > The per-pixel functions were having performance issues due to indirect > > > > > > function call overhead. > > > > > > > > > > > > The per-line code trades off memory for execution time. The `line_buffer` > > > > > > allows us to diminish the number of function calls. > > > > > > > > > > > > Results in the IGT test `kms_cursor_crc`: > > > > > > > > > > > > | Frametime | > > > > > > |:---------------:|:---------:|:----------:|:--------:| > > > > > > | implmentation | Current | Per-pixel | Per-line | > > > > > > | frametime range | 8~22 ms | 32~56 ms | 6~19 ms | > > > > > > | Average | 10.0 ms | 35.8 ms | 8.6 ms | > > > > > > > > > > > > Reported-by: kernel test robot <lkp@xxxxxxxxx> > > > > > > Signed-off-by: Igor Torrente <igormtorrente@xxxxxxxxx> > > > > > > --- > > > > > > V2: Improves the performance drastically, by perfoming the operations > > > > > > per-line and not per-pixel(Pekka Paalanen). > > > > > > Minor improvements(Pekka Paalanen). > > > > > > --- > > > > > > drivers/gpu/drm/vkms/vkms_composer.c | 321 ++++++++++++++++----------- > > > > > > drivers/gpu/drm/vkms/vkms_formats.h | 155 +++++++++++++ > > > > > > 2 files changed, 342 insertions(+), 134 deletions(-) > > > > > > create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h > > > > > > > > > > > > diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c > > > > > > index 383ca657ddf7..69fe3a89bdc9 100644 > > > > > > --- a/drivers/gpu/drm/vkms/vkms_composer.c > > > > > > +++ b/drivers/gpu/drm/vkms/vkms_composer.c > > ... > > > > > > > +struct vkms_pixel_composition_functions { > > > > > > + void (*get_src_line)(void *pixels_addr, int length, u64 *line_buffer); > > > > > > + void (*set_output_line)(void *pixels_addr, int length, u64 *line_buffer); > > > > > > > > > > I would be a little more comfortable if instead of u64 *line_buffer you > > > > > would have something like > > > > > > > > > > struct line_buffer { > > > > > u16 *row; > > > > > size_t nelem; > > > > > } > > > > > > > > > > so that the functions to be plugged into these function pointers could > > > > > assert that you do not accidentally overflow the array (which would > > > > > imply a code bug in kernel). > > > > > > > > > > One could perhaps go even for: > > > > > > > > > > struct line_pixel { > > > > > u16 r, g, b, a; > > > > > }; > > > > > > > > > > struct line_buffer { > > > > > struct line_pixel *row; > > > > > size_t npixels; > > > > > }; > > > > > > > > If we decide to follow this representation, would it be possible > > > > to calculate the crc in the similar way that is being done currently? > > > > > > > > Something like that: > > > > > > > > crc = crc32_le(crc, line_buffer.row, w * sizeof(line_pixel)); > > > > > > Hi Igor, > > > > > > yes. I think the CRC calculated does not need to be reproducible in > > > userspace, so you can very well compute it from the internal > > > intermediate representation. It also does not need to be portable > > > between architectures, AFAIU. > > > > Great! This will make things easier. > > > > > > > > > I mean, If the compiler can decide to put a padding somewhere, it > > > > would mess with the crc value. Right? > > > > > > Padding could mess it up, yes. However, I think in kernel it is a > > > convention to define structs (especially UAPI structs but this is not > > > one) such that there is no implicit padding. So there must be some > > > recommended practises on how to achieve and ensure that. > > > > > > The size of struct line_pixel as defined above is 8 bytes which is a > > > "very round" number, and every field has the same type, so there won't > > > be gaps between fields either. So I think the struct should already be > > > fine and have no padding, but how to make sure it is, I'm not sure what > > > you would do in kernel land. > > > > > > In userspace I would put a static assert to ensure that > > > sizeof(struct line_pixel) = 8. That would be enough, because sizeof > > > counts not just internal implicit padding but also the needed size > > > extension for alignment in an array of those. The accumulated size of > > > the fields individually is 8 bytes, so if the struct size is 8, there > > > cannot be padding. > > > > > > > Apparently the kernel uses a compiler extension in a macro to do this > > kind of struct packing. > > > > include/linux/compiler_attributes.h > > 265:#define __packed __attribute__((__packed__)) > > Hi Igor, > > we do not actually want to force packing, though. > > If there would be padding without packing, then packing may incur a > noticeable speed penalty in accessing the fields. We don't want to risk > that. I understand... > > So I think it's better to just assert that no padding exists instead. > There would be something quite strange going on if there was padding in > this case, but better safe than sorry, because debugging that would be > awful. > OK. I will do that and also test some alternatives. > > Thanks, > pq Thanks, --- Igor M. A. Torrente