Re: Hantro H1 Encoding Upstreaming

Nicolas Dufresne <nicolas.dufresne@xxxxxxxxxxxxx> · Wed, 15 Jan 2025 15:14:33 -0500

Le mercredi 15 janvier 2025 à 16:03 +0100, Paul Kocialkowski a écrit :
> Would be glad to not have to work on the GStreamer side and focus on kernel
> work instead. Sofar we can already aim to support:
> - Hantro H1
> - Hantro H2/VC8000E
> - Allwinner Video Engine

And Rockchip VEPUs, which have Open Source software implementation in libMPP.
Most of have access to reference software for the Hantro variants, I suppose you
have revered the Allwinner ?

p.s. there is also Imagination stateless codecs, but I only seen them on older
TI board.

> 
> > If you'd like to take a bite, this is a good thread to discuss forward. Until
> > the summer, I planned to reach to Paul, who made this great presentation [1] at
> > FOSDEM last year and start moving the RFC into using these ideas. One of the
> > biggest discussion is rate control, it is clear to me that modern HW integrated
> > RC offloading, though some HW specific knobs or even firmware offloading, and
> > this is what Paul has been putting some thought into.
> 
> In terms of RC offloading, what's I've seen in the Hantro H1 is a checkpoint
> mechanism that allows making per-slice QP adjustments around the global picture
> QP to bit the bill in terms of size. This can be a desirable thing if the use
> case is to stick to a given bitrate strictly.
> 
> There's also the regions of interest that are supported by many (most?) encoders
> and allow region-based QP changes (typically as offset). The number of available
> slots is hardware-specific.

Checkpoints seems unique Hantro, it has a lot of limitation as it 8 a raster set
of blocks. It won't perform well with a important object in the middle of the
scene.

> 
> In addition the H1 provides some extra statistics such as the "average"
> resulting QP when on of these methods is used.

Wasn't the statistic MAD (mean average distance), which is basically the average
residual values ? In my copy of VC8000E reference someone, all that has been
commented out, and the x265 implementation copied over (remember you can pay to
use their code in proprietary form, before jumping onto license violation).

> 
> I guess my initial point about rate control was that it would be easier for
> userspace to be able to choose a rate-control strategy directly and to have
> common implementations kernel-side that would apply to all codecs. It also
> allows leveraging hardware features without userspace knowing about them.
> 
> However the main drawback is that there will always be a need for a more
> specific/advanced use-case than what the kernel is doing (e.g. using a npu),
> which would need userspace to have more control over the encoder.

Which brings to the most modern form of advanced rate control. You will find
this in DXVA and Vulkan Video. It consist of splitting the image as an even
grid, and allowing delta or qualitative differences of QP for each of the
element in the grid. The size of that grid is limited by HW, you can implement
ROI on top of this too. Though, if the HW has ROI directly, we don't have much
option but to expose it as such, which is fine. A lot of stateful encoder have
that too, and the controls should be the same.

> 
> So a more direct interface would be required to let userspace do rate-control.
> At the end of the day, I think it would make more sense to expose these encoders
> for what they are and deal with the QP and features directly through the uAPI
> and avoid any kernel-side rate-control. Hardware-specific features that need to
> be configured and may return stats would just have extra controls for those.
> 
> So all in all we'd need a few new controls to configure the encode for codecs
> (starting with h.264) and also some to provide encode stats (e.g. requested qp,
> average qp). It feels like we could benefit from existing stateful encoder
> controls for various bitstream parameters.

Sounds like we should offer both. As I stated earlier, modern HW resort to
firmware offloading for performance reason. In V4L2, this is even more true. If
you read statistics such as MAD, bitstream size in a frame by frame basis, then
you will never queue more then 1 buffer on the capture side. So the programming
latency (including RC latency) will directly impact the encoder throughput. With
offloading, the statistic can be handled in firmware, or without any context
switch, which improve throughput.

This needs to be unbiased, the GStreamer implementation we did for the last RFC
runs frame by frame, using last frame size as the statistic. We still managed
the specified IP performance documented in the white paper.

Like everything else, we don't need all this in a first uAPI, but we need to
define the minimum "required" features.

> 
> Then userspace would be responsible for configuring each encode run with a
> target QP value, picture type and list of references. We'd need to also inform
> userspace of how many references are supported.

The H1 only have 1 reference + 1 long term reference (which only 1 reference was
implemented). We used the default reference model, so there was only one way to
manage and pass reference. There is clearly a lot more research to be done
around reference management.

Nicolas