Re: Stateless Encoding uAPI Discussion and Proposal

Nicolas Dufresne <nicolas.dufresne@xxxxxxxxxxxxx> · Mon, 21 Aug 2023 11:13:44 -0400

Hello again,

I've been away last week.

Le vendredi 11 août 2023 à 22:08 +0200, Paul Kocialkowski a écrit :
> Hi Nicolas,
> 
> On Thu 10 Aug 23, 10:34, Nicolas Dufresne wrote:
> > Le jeudi 10 août 2023 à 15:44 +0200, Paul Kocialkowski a écrit :
> > > Hi folks,
> > > 
> > > On Tue 11 Jul 23, 19:12, Paul Kocialkowski wrote:
> > > > I am now working on a H.264 encoder driver for Allwinner platforms (currently
> > > > focusing on the V3/V3s), which already provides some usable bitstream and will
> > > > be published soon.
> > > 
> > > So I wanted to shared an update on my side since I've been making progress on
> > > the H.264 encoding work for Allwinner platforms. At this point the code supports
> > > IDR, I and P frames, with a single reference. It also supports GOP (both closed
> > > and open with IDR or I frame interval and explicit keyframe request) but uses
> > > QP controls and does not yet provide rate control. I hope to be able to
> > > implement rate-control before we can make a first public release of the code.
> > 
> > Just a reminder that we will code review the API first, the supporting
> > implementation will just be companion. So in this context, the sooner the better
> > for an RFC here.
> 
> I definitely want to have some proposal that is (even vaguely) agreed upon
> before proposing patches for mainline, even at the stage of RFC.
> 
> While I already have working results at this point, the API that is used is
> very basic and just reuses controls from stateful encoders, with no extra
> addition. Various assumptions are made in the kernel and there is no real
> reference management, since the previous frame is always expected to be used
> as the only reference.

One thing we are looking at these days, and aren't current controllable in
stateful interface is RTP RPSI (reference picture selection indication). This is
feedback that a remote decoder sends when a reference picture has been decoded.
In short, even if only 1 reference is used, we'd like the reference to change
only when we received the acknowledgement that the new one has been
reconstructed on the other side.

I'm not super keep in having to modify the Linux kernel specially for this
feature. Specially that similar API offer it at a lower level (VA, D3D12, and
probably future API).

> 
> We plan to make a public release at some point in the near future which shows
> these working results, but it will not be a base for our discussion here yet.
> 
> > > One of the main topics of concern now is how reference frames should be managed
> > > and how it should interact with kernel-side GOP management and rate control.
> > 
> > Maybe we need to have a discussion about kernel side GOP management first ?
> > While I think kernel side rate control is un-avoidable, I don't think stateless
> > encoder should have kernel side GOP management.
> 
> I don't have strong opinions about this. The rationale for my proposal is that
> kernel-side rate control will be quite difficult to operate without knowledge
> of the period at which intra/inter frames are produced. Maybe there are known
> methods to handle this, but I have the impression that most rate control
> implementations use the GOP size as a parameter.
> 
> More generally I think an expectation behind rate control is to be able to
> decide at which time a specific frame type is produced. This is not possible if
> the decision is entirely up to userspace.

In Television (and Youtube) streaming, the GOP size is just fixed, and you deal
with it. In fact, I never seen GOP or picture pattern being modified by the rate
control. In general, the high end rate controls will follow an HRD
specification. The rate controls will require information that represent
constraints, this is not limited to the rate. In H.264/HEVC, the level and
profile will play a role. But you could also add the VBV size and probably more.
I have never read the HRD specification completely.

In cable streaming notably, the RC job is to monitor the about of bits over a
period of time (the window). This window is defined by the streaming hardware
buffering capabilities. Best at this point is to start reading through HRD
specifications, and open source rate control implementation (notably x264).

I think overall, we can live with adding hints were needed, and if the gop
information is appropriate hint, then we can just reuse the existing control.

> 
> > > Leaving GOP management to the kernel-side implies having it decide which frame
> > > should be IDR, I or P (and B for encoders that can support it), while keeping
> > > the possibility to request a keyframe (IDR) and configure GOP size. Now it seems
> > > to me that this is already a good balance between giving userspace a decent
> > > level of control while not having to specify the frame type explicitly for each
> > > frame or maintain a GOP in userspace.
> > 
> > My expectation for stateless encoder is to have to specify the frame type and
> > the associate references if the type requires it.

Ack. For us, this is also why we would require requests (unlike statful
encoder), as we have per frame information to carry, and requests explicitly
attach the information to the frame.

> > 
> > > 
> > > Requesting the frame type explicitly seems more fragile as many situations will
> > > be invalid (e.g. requesting a P frame at the beginning of the stream, etc) and
> > > it generally requires userspace to know a lot about what the codec assumptions
> > > are. Also for B frames the decision would need to be consistent with the fact
> > > that a following frame (in display order) would need to be submitted earlier
> > > than the current frame and inform the kernel so that the picture order count
> > > (display order indication) can be maintained. This is not impossible or out of
> > > reach, but it brings a lot of complexity for little advantage.
> > 
> > We have had a lot more consistent results over the last decade with stateless
> > hardware codecs in contrast to stateful where we endup with wide variation in
> > behaviour. This applies to Chromium, GStreamer and any active users of VA
> > encoders really. I'm strongly in favour for stateless reference API out of the
> > Linux kernel.
> 
> Okay I understand the lower level of control make it possible to get much better
> results than opaque firmware-driven encoders and it would be a shame to not
> leverage this possibility with an API that is too restrictive.
> 
> However I do think it should be possible to operate the encoder without a lot
> of codec-specific supporting code from userspace. This is also why I like having
> kernel-side rate control (among other reasons).

Ack. We need a compromise here.

[...]

> 
> > > The next topic of interest is reference management. It seems pretty clear that
> > > the decision of whether a frame should be a reference or not always needs to be
> > > taken when encoding that frame. In H.264 the nal_ref_idc slice header element
> > > indicates whether a frame is marked as reference or not. IDR frames can
> > > additionally be marked as long-term reference (if I understood correctly, the
> > > frame will stay in the reference picture list until the next IDR frame).
> > 
> > This is incorrect. Any frames can be marked as long term reference, it does not
> > matter what type they are. From what I recall, marking of the long term in the
> > bitstream is using a explicit IDX, so there is no specific rules on which one
> > get evicted. Long term of course are limited as they occupy space in the DPB. 
> > Also, Each CODEC have different DPB semantic. For H.264, the DPB can run in two
> > modes. The first is a simple fifo, in this case, any frame you encode and want
> > to keep as reference is pushed into the DPB (which has a fixed size minus the
> > long term). If full, the oldest frame is removed. It is not bound to IDR or GOP.
> > Though, an IDR will implicitly cause the decoder to evict everything (including
> > long term).
> > 
> > The second mode uses the memory management commands. This is a series if
> > instruction that the encoder can send to the decoder. The specification is quite
> > complex, it is a common source of bugs in decoders and a place were stateless
> > hardware codecs performs more consistently in general. Through the commands, the
> > encoder ensure that the decoder dpb representation stay on sync.
> 
> This is also what I understand from repeated reading of the spec and thanks for
> the summary write-up!
> 
> My assumption was that it would be preferable to operate in the simple fifo
> mode since the memory management commands need to be added to the bitstream
> headers and require coordination from the kernel. Like you said it seems complex
> and error-prone.
> 
> But maybe this mechanism could be used to allow any particular reference frame
> configuration, opening the way for userspace to fully decide what the reference
> buffer lists are? Also it would be good to know if such mechanisms are generally
> present in codecs or if most of them have an implicit reference list that cannot
> be modified.

Of course, the subject is much more relevant when there is encoders with more
then 1 reference. But you are correct, what the commands do, is allow to change,
add or remove any reference from the list (random modification), as long as they
fit in the codec contraints (like the DPB size notably). This is the only way
one can implement temporal SVC reference pattern, robust reference trees or RTP
RPSI. Note that long term reference also exists, and are less complex then these
commands.

I this raises a big question, and I never checked how this worked with let's say
VA. Shall we let the driver resolve the changes into commands (VP8 have
something similar, while VP9 and AV1 are refresh flags, which are just trivial
to compute). I believe I'll have to investigate this further.

> > 
[...]

> > > Addition information gathered:
> > > - It seems likely that the Allwinner Video Engine only supports one reference
> > >   frame. There's a register for specifying the rec buffer of a second one but
> > >   I have never seen the proprietary blob use it. It might be as easy as
> > >   specifying a non-zero address there but it might also be ignored or require
> > >   some undocumented bit to use more than one reference. I haven't made any
> > >   attempt at using it yet.
> > 
> > There is something in that fact that makes me think of Hantro H1. Hantro H1 also
> > have a second reference, but non one ever use it. We have on our todo to
> > actually give this a look.
> 
> Having looked at both register layouts, I would tend to think both designs
> are distinct. It's still unclear where Allwinner's video engine comes from:
> perhaps they made it in-house, perhaps some obscure Chinese design house made it
> for them or it could be known hardware with a modified register layout.

Ack,
> 
> I would also be interested to know if the H1 can do more than one reference!

>From what we have in our pretty thin documentation, references are being
"searched" for fuzzy match and motion. So when you pass 2 references to the
encoder, then the encoder will search equally in both. I suspect it does a lot
more then that, and saves some information in the auxiliary buffers that exist
per reference, but this isn't documented and I'm not specialized enough really.

>From usage perspective, all you have to do is give it access to the references
picture data (reconstructed image and auxiliary data). The result is compressed
macroblock data that may refer to these. We don't really know if it is used, but
we do assume it is and place it in the reference list. This is of course normal
thing to do, specially when using a reference fifo.

In theory, you could implement multiple reference with a HW that only supports
1. A technique could be to compress the image multiple time, and keep the "best"
one for the current configuration. Though, a proper multi-pass encoder would
avoid the bandwidth overhead of compressing and writing the temporary result.

> 
> > > - Contrary to what I said after Andrzej's talk at EOSS, most Allwinner platforms
> > >   do not support VP8 encode (despite Allwinner's proprietary blob having an
> > >   API for it). The only platform that advertises it is the A80 and this might
> > >   actually be a VP8-only Hantro H1. It seems that the API they developed in the
> > >   library stuck around even if no other platform can use it.
> > 
> > Thanks for letting us know. Our assumption is that a second hardware design is
> > unlikely as Google was giving it for free to any hardware makers that wanted it.
> > 
> > > 
> > > Sorry for the long email again, I'm trying to be a bit more explanatory than
> > > just giving some bare conclusions that I drew on my own.
> > > 
> > > What do you think about these ideas?
> > 
> > In general, we diverge on the direction we want the interface to be. What you
> > seem to describe now is just a normal stateful encoder interface with everything
> > needed to drive the stateless hardware implemented in the Linux kernel. There is
> > no parsing or other unsafety in encoders, so I don't have a strict no-go
> > argument for that, but for me, it means much more complex drivers and lesser
> > flexibility. The VA model have been working great for us in the past, giving us
> > the ability to implement new feature, or even slightly of spec features. While,
> > the Linux kernel might not be the right place for these experimental methods.
> 
> VA seems too low-level for our case here, as it seems to expect full control
> over more or less each bitstream parameter that will be produced.
> 
> I think we have to find some middle-ground that is not as limiting as stateful
> encoders but not as low-level as VA.
> 
> > Personally, I would rather discuss around your uAPI RFC though, I think a lot of
> > other devs here would like to see what you have drafted.
> 
> Hehe I wish I had some advanced proposal here but my implementation is quite
> simplified compared to what we have to plan for mainline.

No worries, let's do that later then. On our side, we have similar limitation,
since we have to have something working before we can spend more time in turning
it into something upstream. So we have "something" for VP8, we'll do "something"
for H.264, from there we should be able to iterate. But having the opportunity
to iterate over a more capable hardware would clearly help understand the bigger
picture.

cheers,
Nicolas