Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support

Paul Kocialkowski <paul.kocialkowski@xxxxxxxxxxx> · Tue, 04 Jun 2019 11:28:28 +0200

Hi,

On Tue, 2019-06-04 at 09:15 +0000, Jonas Karlman wrote:
> On 2019-06-04 11:06, Thierry Reding wrote:
> > On Tue, Jun 04, 2019 at 10:49:21AM +0200, Boris Brezillon wrote:
> > > On Tue, 4 Jun 2019 10:31:57 +0200
> > > Thierry Reding <thierry.reding@xxxxxxxxx> wrote:
> > > 
> > > > > > > > - Using flags
> > > > > > > > 
> > > > > > > > The current MPEG-2 controls have lots of u8 values that can be
> > > > > > > > represented as flags. Using flags also helps with padding.
> > > > > > > > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > > > > > > > default for that sounds fine (we definitely do want to keep some room
> > > > > > > > available and I don't think using 32 bits as a default is good enough).
> > > > > > > > 
> > > > > > > > I think H.264/HEVC per-control flags should also be moved to u64.    
> > > > > > > There was also some concensus on this, that u64 should be good enough
> > > > > > > for anything out there, though we obviously don't know what the future
> > > > > > > will hold, so perhaps adding some way for possible extending this in the
> > > > > > > future might be good. I guess we'll get new controls for new codecs
> > > > > > > anyway, so we can punt on this until then.
> > > > > > >     
> > > > > > > > - Clear split of controls and terminology
> > > > > > > > 
> > > > > > > > Some codecs have explicit NAL units that are good fits to match as
> > > > > > > > controls: e.g. slice header, pps, sps. I think we should stick to the
> > > > > > > > bitstream element names for those.
> > > > > > > > 
> > > > > > > > For H.264, that would suggest the following changes:
> > > > > > > > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> > > > > > > > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > > > > > > > where they belong, which seems to be slice_header;    
> > > > > > But now here it's being described per slice. When I look at the slice
> > > > > > header, I only see list of modifications and when I look at userspace,
> > > > > > That list is simply built from DPB, the modifications list found in the
> > > > > > slice header seems to be only used to craft the l0/l1 list.  
> > > > > Yes, I think there was a misunderstanding which was then clarified
> > > > > (unfortunately it happened on IRC, so we don't have a trace of this
> > > > > discussion). The reference list should definitely be per-frame, and the
> > > > > L0/L1 slice reflists are referring to the per-frame reference list (it's
> > > > > just a sub-set of the per-frame reflist re-ordered differently).
> > > > >   
> > > > > > There is one thing that come up though, if we enable per-frame decoding
> > > > > > on top of per-slice decoder (like Cedrus), won't we force userspace to
> > > > > > always compute l0/l1 even though the HW might be handling that ?  
> > > > > That's true, the question is, what's the cost of this extra re-ordering?  
> > > > I think ultimately userspace is already forced to compute these lists
> > > > even if some hardware may be able to do it in hardware. There's going to
> > > > be other hardware that userspace wants to support that can't do it by
> > > > itself, so userspace has to at least have the code anyway. What it could
> > > > do on top of that decide not to run that code if it somehow detects that
> > > > hardware can do it already. On the other hand this means that we have to
> > > > expose a whole lot of capabilities to userspace and userspace has to go
> > > > and detect all of them in order to parameterize all of the code.
> > > > 
> > > > Ultimately I suspect many applications will just choose to pass the data
> > > > all the time out of simplicity. I mean drivers that don't need it will
> > > > already ignore it (i.e. they must not break if they get the extra data)
> > > > so other than the potential runtime savings on some hardware, there are
> > > > no advantages.
> > > > 
> > > > Given that other APIs don't bother exposing this level of control to
> > > > applications makes me think that it's just not worth it from a
> > > > performance point of view.
> > > That's not exactly what Nicolas proposed. He was suggesting that we
> > > build those reflists kernel-side: V4L would provide an helper and
> > > drivers that need those lists would use it, others won't. This way we
> > > have no useless computation done, and userspace doesn't even have to
> > > bother checking the device caps to avoid this extra step.
> > Oh yeah, that sounds much better. I suppose one notable differences to
> > other APIs is that they have to pass in buffers for all the frames in
> > the DPB, so they basically have to build the lists in userspace. Since
> > we'll end up looking up the frames in the kernel, it sounds reasonable
> > to also build the lists in the kernel.
> 
> Userspace must already process the modification list or it wont have
> correct DPB for next frame.
> If you move this processing into kernel side you also introduce state
> into the stateless driver.

There is state in the form of the m2m context anyway, so I don't think
that's a concern in particular. We've been using the "stateless"
terminology all around, but it's really more about programming the
decoding registers versus passing raw bitstream through a mailbox
interface rather than the fine stateless/stateful distinction.

Cheers,

Paul

> Regards,
> Jonas
> > On that note, it would probably be useful to have some sort of helper
> > to get at all the buffers that make up the DPB in the kernel. That's got
> > to be something that everybody wants.
> > 
> > Thierry
-- 
Paul Kocialkowski, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com