Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support

Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> · Tue, 4 Jun 2019 11:38:24 +0200

On Tue, 4 Jun 2019 09:15:28 +0000
Jonas Karlman <jonas@xxxxxxxxx> wrote:

> On 2019-06-04 11:06, Thierry Reding wrote:
> > On Tue, Jun 04, 2019 at 10:49:21AM +0200, Boris Brezillon wrote:  
> >> On Tue, 4 Jun 2019 10:31:57 +0200
> >> Thierry Reding <thierry.reding@xxxxxxxxx> wrote:
> >>  
> >>>>>>> - Using flags
> >>>>>>>
> >>>>>>> The current MPEG-2 controls have lots of u8 values that can be
> >>>>>>> represented as flags. Using flags also helps with padding.
> >>>>>>> It's unlikely that we'll get more than 64 flags, so using a u64 by
> >>>>>>> default for that sounds fine (we definitely do want to keep some room
> >>>>>>> available and I don't think using 32 bits as a default is good enough).
> >>>>>>>
> >>>>>>> I think H.264/HEVC per-control flags should also be moved to u64.      
> >>>>>> There was also some concensus on this, that u64 should be good enough
> >>>>>> for anything out there, though we obviously don't know what the future
> >>>>>> will hold, so perhaps adding some way for possible extending this in the
> >>>>>> future might be good. I guess we'll get new controls for new codecs
> >>>>>> anyway, so we can punt on this until then.
> >>>>>>       
> >>>>>>> - Clear split of controls and terminology
> >>>>>>>
> >>>>>>> Some codecs have explicit NAL units that are good fits to match as
> >>>>>>> controls: e.g. slice header, pps, sps. I think we should stick to the
> >>>>>>> bitstream element names for those.
> >>>>>>>
> >>>>>>> For H.264, that would suggest the following changes:
> >>>>>>> - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> >>>>>>> - killing v4l2_ctrl_h264_decode_param and having the reference lists
> >>>>>>> where they belong, which seems to be slice_header;      
> >>>>> But now here it's being described per slice. When I look at the slice
> >>>>> header, I only see list of modifications and when I look at userspace,
> >>>>> That list is simply built from DPB, the modifications list found in the
> >>>>> slice header seems to be only used to craft the l0/l1 list.    
> >>>> Yes, I think there was a misunderstanding which was then clarified
> >>>> (unfortunately it happened on IRC, so we don't have a trace of this
> >>>> discussion). The reference list should definitely be per-frame, and the
> >>>> L0/L1 slice reflists are referring to the per-frame reference list (it's
> >>>> just a sub-set of the per-frame reflist re-ordered differently).
> >>>>     
> >>>>> There is one thing that come up though, if we enable per-frame decoding
> >>>>> on top of per-slice decoder (like Cedrus), won't we force userspace to
> >>>>> always compute l0/l1 even though the HW might be handling that ?    
> >>>> That's true, the question is, what's the cost of this extra re-ordering?    
> >>> I think ultimately userspace is already forced to compute these lists
> >>> even if some hardware may be able to do it in hardware. There's going to
> >>> be other hardware that userspace wants to support that can't do it by
> >>> itself, so userspace has to at least have the code anyway. What it could
> >>> do on top of that decide not to run that code if it somehow detects that
> >>> hardware can do it already. On the other hand this means that we have to
> >>> expose a whole lot of capabilities to userspace and userspace has to go
> >>> and detect all of them in order to parameterize all of the code.
> >>>
> >>> Ultimately I suspect many applications will just choose to pass the data
> >>> all the time out of simplicity. I mean drivers that don't need it will
> >>> already ignore it (i.e. they must not break if they get the extra data)
> >>> so other than the potential runtime savings on some hardware, there are
> >>> no advantages.
> >>>
> >>> Given that other APIs don't bother exposing this level of control to
> >>> applications makes me think that it's just not worth it from a
> >>> performance point of view.  
> >> That's not exactly what Nicolas proposed. He was suggesting that we
> >> build those reflists kernel-side: V4L would provide an helper and
> >> drivers that need those lists would use it, others won't. This way we
> >> have no useless computation done, and userspace doesn't even have to
> >> bother checking the device caps to avoid this extra step.  
> > Oh yeah, that sounds much better. I suppose one notable differences to
> > other APIs is that they have to pass in buffers for all the frames in
> > the DPB, so they basically have to build the lists in userspace. Since
> > we'll end up looking up the frames in the kernel, it sounds reasonable
> > to also build the lists in the kernel.  
> 
> Userspace must already process the modification list or it wont have correct DPB for next frame.

Can you point us to the code or the section in the spec that
mentions/proves this dependency? I might have missed something, but my
understanding was that the slice ref lists (or the list of
modifications to apply to the list of long/short refs attached to a
frame) had no impact on the list of long/short refs attached to the
following frame.