Re: [PATCH 08/10] media: uapi: h264: Clean slice invariants syntax elements

Nicolas Dufresne <nicolas.dufresne@xxxxxxxxxxxxx> · Mon, 27 Jul 2020 15:43:48 -0400

Le lundi 27 juillet 2020 à 20:10 +0200, Tomasz Figa a écrit :
> On Mon, Jul 27, 2020 at 6:18 PM Ezequiel Garcia <ezequiel@xxxxxxxxxxxxx> wrote:
> > On Mon, 2020-07-27 at 16:52 +0200, Tomasz Figa wrote:
> > > On Mon, Jul 27, 2020 at 4:39 PM Ezequiel Garcia <ezequiel@xxxxxxxxxxxxx> wrote:
> > > > Hi Alexandre,
> > > > 
> > > > Thanks a lot for the review.
> > > > 
> > > > On Sat, 2020-07-25 at 23:34 +0900, Alexandre Courbot wrote:
> > > > > On Thu, Jul 16, 2020 at 5:23 AM Ezequiel Garcia <ezequiel@xxxxxxxxxxxxx> wrote:
> > > > > > The H.264 specification requires in its "Slice header semantics"
> > > > > > section that the following values shall be the same in all slice headers:
> > > > > > 
> > > > > >   pic_parameter_set_id
> > > > > >   frame_num
> > > > > >   field_pic_flag
> > > > > >   bottom_field_flag
> > > > > >   idr_pic_id
> > > > > >   pic_order_cnt_lsb
> > > > > >   delta_pic_order_cnt_bottom
> > > > > >   delta_pic_order_cnt[ 0 ]
> > > > > >   delta_pic_order_cnt[ 1 ]
> > > > > >   sp_for_switch_flag
> > > > > >   slice_group_change_cycle
> > > > > > 
> > > > > > and can therefore be moved to the per-frame decode parameters control.
> > > > > 
> > > > > I am really not a H.264 expert, so this question may not be relevant,
> > > > 
> > > > All questions are welcome. I'm more than happy to discuss this patchset.
> > > > 
> > > > > but are these values specified for every slice header in the
> > > > > bitstream, or are they specified only once per frame?
> > > > > 
> > > > > I am asking this because it would certainly make user-space code
> > > > > simpler if we could remain as close to the bitstream as possible. If
> > > > > these values are specified once per slice, then factorizing them would
> > > > > leave user-space with the burden of deciding what to do if they change
> > > > > across slices.
> > > > > 
> > > > > Note that this is a double-edged sword, because it is not necessarily
> > > > > better to leave the firmware in charge of deciding what to do in such
> > > > > a case. :) So hopefully these are only specified once per frame in the
> > > > > bitstream, in which case your proposal makes complete sense.
> > > > 
> > > > Frame-based hardwares accelerators such as Hantro and Rockchip VDEC
> > > > are doing the slice header parsing themselves. Therefore, the
> > > > driver is not really parsing these fields on each slice header.
> > > > 
> > > > Currently, we are already using only the first slice in a frame,
> > > > as you can see from:
> > > > 
> > > >         if (slices[0].flags & V4L2_H264_SLICE_FLAG_FIELD_PIC)
> > > >                 reg |= G1_REG_DEC_CTRL0_PIC_FIELDMODE_E;
> > > > 
> > > > Even if these fields are transported in the slice header,
> > > > I think it makes sense for us to split them into the decode params
> > > > (per-frame) control.
> > > > 
> > > > They are really specified to be the same across all slices,
> > > > so even I'd say if a bitstream violates this, it's likely
> > > > either a corrupted bitstream or an encoder bug.
> > > > 
> > > > OTOH, one thing this makes me realize is that the slice params control
> > > > is wrongly specified as an array.
> > > 
> > > It is _not_.
> > > 
> > 
> > We introduced the hold capture buffer specifically to support
> > this without having a slice array.
> > 
> > I don't think we have a plan to support this control properly
> > as an array.
> > 
> > If we decide to support the slice control as an array,
> > we would have to implement a mechanism to specify the array size,
> > which we currently don't have AFAIK.
> > 
> 
> That wasn't the conclusion when we discussed this last time on IRC.
> +Nicolas Dufresne
> 
> Currently the 1-slice per buffer model is quite impractical:
> 1) the maximum number of buffers is 32, which for some streams can be
> less than needed to queue a single frame,

To give more context, it seems the discussion was about being able to
use slice decoder with a 1 poll() per frame model. Of course this will
never be as efficient as when using a frame base decoder, but as
current design, you can keep a list of pending request (each request is
1 slice/buffer), and simply use memory pressure to poll a mid point and
dequeue the remaining. An example, yo have 8 pending request, and reach
your memory limit:

  [R1, R2, R3, R4, R5, R6, R7, R8]

As requests are in order and behaves like memory fences, you can pick
R6, and poll() that one. When R6 is ready, you can then dequeue R1 to
R6 without blocking. In this context, a limit of 16 or 32 buffers seems
fair, the optimization we can do in userspace is likely sufficient. So
I'd like to drop problem 1) from our list.

> 2) even more system call overhead due to the need to repeat various
> operations (e.g. qbuf/dqbuf) per-slice rather than per-frame,
> 3) no way to do hardware batching for hardware which supports queuing
> multiple slices at a time,
> 4) waste of memory - one needs to allocate all the OUTPUT buffers
> pessimistically to accommodate the biggest possible slice, while with
> all-slices-per-frame 1 buffer could be just heuristically allocated to
> be enough for the whole frame.
> 
> These need to be carefully evaluated, with some proper testing done to
> confirm whether they are really a problem or not.

2, 3 and 4 seems to match what the currently unimplemented API propose.
You can mitigate 2) but having multiple slices per buffers. That came
with a byte offset to we can program the HW as if it was separate slice
buffers. But was limited to 16 buffers, likely a fair compromise.

3) is about batching, in the only use case we know, the batching
acceleration consist of programming the next operation on the
completion IRQ. I already looked with the Cedrus developers if and how
that was feasible, but we don't have a PoC yet. The optimization is
about removing context switches between operations, which could prevent
fully using the HW.

4) is also well covered with being able to multiplex 1 buffer with
multiple slices.

To be fair, I understand why we'd like to drop this API, as none of the
active developers here of slice decoder (cedrus) have time to engage in
supporting this untested "optimization". It's not only about kernel
support, but also requires userspace work. I also agree that it could
be added later, as an extension. It could be done with 3 new controls,
an array of slice_params and an array of slice start offset and the
number of slices, or just one, introduce a new structure that have a
slice_params structure embedded, num_slices and an array of
slice_start_offset. I don't have preference myself, but I'm just
illustrating that yes, we could drop the slice batching to avoid
pushing untested APIs without scarifying our ability to decode a valid
stream.

> 
> > > > Namely, this text
> > > > should be removed:
> > > > 
> > > >        This structure is expected to be passed as an array, with one
> > > >        entry for each slice included in the bitstream buffer.
> > > > 
> > > > As the API is really not defined that way.
> > > > 
> > > > I'll remove that on next iteration.
> > > 
> > > The v4l2_ctrl_h264_slice_params struct has more data than those that
> > > are deemed to be the same across all the slices. A remarkable example
> > > are the size and start_byte_offset fields.
> > 
> > Not sure how this applies to this discussion.
> 
> These fields need to be specified for each slice in the buffer to make
> it possible to handle multiple slices per buffer.
> 
> Best regards,
> Tomasz
Attachment:
signature.asc

Description: This is a digitally signed message part