Re: Hantro H1 Encoding Upstreaming

Paul Kocialkowski <paulk@xxxxxxxxxxx> · Sat, 18 Jan 2025 18:15:58 +0100

Hi,

Le Wed 15 Jan 25, 15:14, Nicolas Dufresne a écrit :
> Le mercredi 15 janvier 2025 à 16:03 +0100, Paul Kocialkowski a écrit :
> > Would be glad to not have to work on the GStreamer side and focus on kernel
> > work instead. Sofar we can already aim to support:
> > - Hantro H1
> > - Hantro H2/VC8000E
> > - Allwinner Video Engine
> 
> And Rockchip VEPUs, which have Open Source software implementation in libMPP.
> Most of have access to reference software for the Hantro variants, I suppose you
> have revered the Allwinner ?

Ah right, I haven't looked at Rockchip's own encoder implementations for a
while. I guess that's also called RKVENC.

> p.s. there is also Imagination stateless codecs, but I only seen them on older
> TI board.

Oh I didn't know Imagination also made stateless encoders. I was under the
impression those used in the various Jacinto families were stateful.

> > > If you'd like to take a bite, this is a good thread to discuss forward. Until
> > > the summer, I planned to reach to Paul, who made this great presentation [1] at
> > > FOSDEM last year and start moving the RFC into using these ideas. One of the
> > > biggest discussion is rate control, it is clear to me that modern HW integrated
> > > RC offloading, though some HW specific knobs or even firmware offloading, and
> > > this is what Paul has been putting some thought into.
> > 
> > In terms of RC offloading, what's I've seen in the Hantro H1 is a checkpoint
> > mechanism that allows making per-slice QP adjustments around the global picture
> > QP to bit the bill in terms of size. This can be a desirable thing if the use
> > case is to stick to a given bitrate strictly.
> > 
> > There's also the regions of interest that are supported by many (most?) encoders
> > and allow region-based QP changes (typically as offset). The number of available
> > slots is hardware-specific.
> 
> Checkpoints seems unique Hantro, it has a lot of limitation as it 8 a raster set
> of blocks. It won't perform well with a important object in the middle of the
> scene.

Yes I'm not saying it's particularly useful but more as an example that some
hardware will provide such unique/custom features.

> > In addition the H1 provides some extra statistics such as the "average"
> > resulting QP when on of these methods is used.
> 
> Wasn't the statistic MAD (mean average distance), which is basically the average
> residual values ? In my copy of VC8000E reference someone, all that has been
> commented out, and the x265 implementation copied over (remember you can pay to
> use their code in proprietary form, before jumping onto license violation).

Ah yes you're right! MAD and average QP. Again not sure how useful it really
is in practice.

> > I guess my initial point about rate control was that it would be easier for
> > userspace to be able to choose a rate-control strategy directly and to have
> > common implementations kernel-side that would apply to all codecs. It also
> > allows leveraging hardware features without userspace knowing about them.
> > 
> > However the main drawback is that there will always be a need for a more
> > specific/advanced use-case than what the kernel is doing (e.g. using a npu),
> > which would need userspace to have more control over the encoder.
> 
> Which brings to the most modern form of advanced rate control. You will find
> this in DXVA and Vulkan Video. It consist of splitting the image as an even
> grid, and allowing delta or qualitative differences of QP for each of the
> element in the grid. The size of that grid is limited by HW, you can implement
> ROI on top of this too. Though, if the HW has ROI directly, we don't have much
> option but to expose it as such, which is fine. A lot of stateful encoder have
> that too, and the controls should be the same.

Oh that's neat! Thanks for the insight and definitely good to have in mind.

> > So a more direct interface would be required to let userspace do rate-control.
> > At the end of the day, I think it would make more sense to expose these encoders
> > for what they are and deal with the QP and features directly through the uAPI
> > and avoid any kernel-side rate-control. Hardware-specific features that need to
> > be configured and may return stats would just have extra controls for those.
> > 
> > So all in all we'd need a few new controls to configure the encode for codecs
> > (starting with h.264) and also some to provide encode stats (e.g. requested qp,
> > average qp). It feels like we could benefit from existing stateful encoder
> > controls for various bitstream parameters.
> 
> Sounds like we should offer both. As I stated earlier, modern HW resort to
> firmware offloading for performance reason. In V4L2, this is even more true. If
> you read statistics such as MAD, bitstream size in a frame by frame basis, then
> you will never queue more then 1 buffer on the capture side. So the programming
> latency (including RC latency) will directly impact the encoder throughput. With
> offloading, the statistic can be handled in firmware, or without any context
> switch, which improve throughput.

Right that is a very valid and central point. Indeed we do need a way to take
the decision about the encode parameters for the next frame pretty much as soon
as the next m2m job is started. Waiting for userspace to take the decision based
on returned statistics would definitely stall the encoder for a while.

On the other hand there are cases where we cannot handle it all kernel-side
and we do need userspace interaction between previous and next frame.

So here is a suggestion here which may sound a bit wild but sounds to me like
it could actually work out: how about adding BPF support in V4L2 for
implementing the encoder strategy?

Then we can have the kernel and userspace working on the same ground and
everything actually running kernel-side without starving the encoder.

I guess we'd essentially need to provide the BPF program with enough information
(maybe some hardware-specific too) to take the decision of the next frame's
encode parameters.

Of course I have no prior knowledge on how to implement this, but again it feels
like it could be a good fit for the situation we have to deal with.

> This needs to be unbiased, the GStreamer implementation we did for the last RFC
> runs frame by frame, using last frame size as the statistic. We still managed
> the specified IP performance documented in the white paper.

That's nice but indeed suboptimal. Let's beat that white paper.

> Like everything else, we don't need all this in a first uAPI, but we need to
> define the minimum "required" features.
> 
> > 
> > Then userspace would be responsible for configuring each encode run with a
> > target QP value, picture type and list of references. We'd need to also inform
> > userspace of how many references are supported.
> 
> The H1 only have 1 reference + 1 long term reference (which only 1 reference was
> implemented). We used the default reference model, so there was only one way to
> manage and pass reference. There is clearly a lot more research to be done
> around reference management.

Yes absolutely.

Cheers,

Paul

-- 
Paul Kocialkowski,

Independent contractor - sys-base - https://www.sys-base.io/
Free software developer - https://www.paulk.fr/

Expert in multimedia, graphics and embedded hardware support with Linux.
Attachment:
signature.asc

Description: PGP signature