Re: Stateless Encoding uAPI Discussion and Proposal

Nicolas Dufresne <nicolas.dufresne@xxxxxxxxxxxxx> · Tue, 22 Aug 2023 16:31:34 -0400

Hi,
> 

[...]

> > In cable streaming notably, the RC job is to monitor the about of bits over a
> > period of time (the window). This window is defined by the streaming hardware
> > buffering capabilities. Best at this point is to start reading through HRD
> > specifications, and open source rate control implementation (notably x264).
> > 
> > I think overall, we can live with adding hints were needed, and if the gop
> > information is appropriate hint, then we can just reuse the existing control.
> > 
> Why we still care about GOP here. Hardware have no idea about GOP at 
> all. Although in codec likes HEVC, IDR and intra pictures's nalu header 
> is different, there is not different in the hardware coding 
> configration. NALU header is generated by the userspace usually.
> 
> While future encoding would regard the current encoded picture as an IDR 
> is completed decided by the userspace.

The discussion was around having basic RC algorithm in the kernel driver,
possibly making use of hardware specific features without actually exposing it
all to userspace. So assuming we do that:

Paul's concern is that for best result, an RC algorithm could use knowledge of
keyframe placement to preserve bucket space (possibly using the last keyframe
size as a hint). Exposing the GOP structure in some form allow "prediction", so
the adaption can lookahead future budget without introducing latency. There is
an alternative, which is to require ahead of time queuing of encode requests.
But this does introduce latency since the way it works in V4L2 today, we need
the picture to be filled by the time we request an encode.

Though, if we drop the GOP structure and favour this approach, the latency could
be regain later by introducing fence base streaming. The technique would be for
a video source (like a capture driver) to pass dmabuf that aren't filled yet,
but have a companion fence. This would allow queuing requests ahead of time, and
all we need is enough pre-allocation to accommodate the desired look ahead. Only
issue is that perhaps this violates the fundamental of "short term" delivery of
fences. But fences can also fail I think, in case the capture was stopped.

We can certainly move forward with this as a future solution, or just don't
implement future aware RC algorithm in term to avoid the huge task this involves
(and possibly patents?)

[...]
> > 

> > Of course, the subject is much more relevant when there is encoders with more
> > then 1 reference. But you are correct, what the commands do, is allow to change,
> > add or remove any reference from the list (random modification), as long as they
> > fit in the codec contraints (like the DPB size notably). This is the only way
> > one can implement temporal SVC reference pattern, robust reference trees or RTP
> > RPSI. Note that long term reference also exists, and are less complex then these
> > commands.
> > 
> 
> If we the userspace could manage the lifetime of reconstruction 
> buffers(assignment, reference), we don't need a command here.

Sorry if I created confusion, the comments was something specific to H.264
coding. Its a compressed form for the reference lists. This information is coded
in the slice header and enabled through adaptive_ref_pic_marking_mode_flag

It was suggested so far to leave h264 slice headers writing to the driver. This
is motivated by H264 slice header not being byte aligned in size, so the
slice_data() is hard to combine. Also, some hardware actually produce the
slice_header. This needs actual hardware interface analyses, cause an H.264
slice header is worth nothing if it cannot instruct the decoder how to maintain
the desired reference state.

I think this aspect should probably not be generalized to all CODECs, since the
packing semantic can largely differ. When the codec header is indeed byte
aligned, it can easily be seperate and combined by application, improve the
application flexibility, reducing the kernel API complexity.
> 
> It is just a problem of how to design another request API control 
> structure to select which buffers would be used for list0, list1.
> > I this raises a big question, and I never checked how this worked with let's say
> > VA. Shall we let the driver resolve the changes into commands (VP8 have
> > something similar, while VP9 and AV1 are refresh flags, which are just trivial
> > to compute). I believe I'll have to investigate this further.
> > 
> > > > 
> > [...]

regards,
Nicolas