Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Le mercredi 22 mai 2019 à 15:48 +0900, Tomasz Figa a écrit :
> On Sat, May 18, 2019 at 11:09 PM Nicolas Dufresne <nicolas@xxxxxxxxxxxx> wrote:
> > Le samedi 18 mai 2019 à 12:29 +0200, Paul Kocialkowski a écrit :
> > > Hi,
> > > 
> > > Le samedi 18 mai 2019 à 12:04 +0200, Jernej Škrabec a écrit :
> > > > Dne sobota, 18. maj 2019 ob 11:50:37 CEST je Paul Kocialkowski napisal(a):
> > > > > Hi,
> > > > > 
> > > > > On Fri, 2019-05-17 at 16:43 -0400, Nicolas Dufresne wrote:
> > > > > > Le jeudi 16 mai 2019 à 20:45 +0200, Paul Kocialkowski a écrit :
> > > > > > > Hi,
> > > > > > > 
> > > > > > > Le jeudi 16 mai 2019 à 14:24 -0400, Nicolas Dufresne a écrit :
> > > > > > > > Le mercredi 15 mai 2019 à 22:59 +0200, Paul Kocialkowski a écrit :
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > Le mercredi 15 mai 2019 à 14:54 -0400, Nicolas Dufresne a écrit :
> > > > > > > > > > Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit :
> > > > > > > > > > > Hi,
> > > > > > > > > > > 
> > > > > > > > > > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit
> > > > :
> > > > > > > > > > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a
> > > > écrit :
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > With the Rockchip stateless VPU driver in the works, we now
> > > > > > > > > > > > > have a
> > > > > > > > > > > > > better idea of what the situation is like on platforms other
> > > > > > > > > > > > > than
> > > > > > > > > > > > > Allwinner. This email shares my conclusions about the
> > > > > > > > > > > > > situation and how
> > > > > > > > > > > > > we should update the MPEG-2, H.264 and H.265 controls
> > > > > > > > > > > > > accordingly.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > - Per-slice decoding
> > > > > > > > > > > > > 
> > > > > > > > > > > > > We've discussed this one already[0] and Hans has submitted a
> > > > > > > > > > > > > patch[1]
> > > > > > > > > > > > > to implement the required core bits. When we agree it looks
> > > > > > > > > > > > > good, we
> > > > > > > > > > > > > should lift the restriction that all slices must be
> > > > > > > > > > > > > concatenated and
> > > > > > > > > > > > > have them submitted as individual requests.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > One question is what to do about other controls. I feel like
> > > > > > > > > > > > > it would
> > > > > > > > > > > > > make sense to always pass all the required controls for
> > > > > > > > > > > > > decoding the
> > > > > > > > > > > > > slice, including the ones that don't change across slices.
> > > > > > > > > > > > > But there
> > > > > > > > > > > > > may be no particular advantage to this and only downsides.
> > > > > > > > > > > > > Not doing it
> > > > > > > > > > > > > and relying on the "control cache" can work, but we need to
> > > > > > > > > > > > > specify
> > > > > > > > > > > > > that only a single stream can be decoded per opened instance
> > > > > > > > > > > > > of the
> > > > > > > > > > > > > v4l2 device. This is the assumption we're going with for
> > > > > > > > > > > > > handling
> > > > > > > > > > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > > > > > > > > > 
> > > > > > > > > > > > My opinion on this is that the m2m instance is a state, and
> > > > > > > > > > > > the driver
> > > > > > > > > > > > should be responsible of doing time-division multiplexing
> > > > > > > > > > > > across
> > > > > > > > > > > > multiple m2m instance jobs. Doing the time-division
> > > > > > > > > > > > multiplexing in
> > > > > > > > > > > > userspace would require some sort of daemon to work properly
> > > > > > > > > > > > across
> > > > > > > > > > > > processes. I also think the kernel is better place for doing
> > > > > > > > > > > > resource
> > > > > > > > > > > > access scheduling in general.
> > > > > > > > > > > 
> > > > > > > > > > > I agree with that yes. We always have a single m2m context and
> > > > > > > > > > > specific
> > > > > > > > > > > controls per opened device so keeping cached values works out
> > > > > > > > > > > well.
> > > > > > > > > > > 
> > > > > > > > > > > So maybe we shall explicitly require that the request with the
> > > > > > > > > > > first
> > > > > > > > > > > slice for a frame also contains the per-frame controls.
> > > > > > > > > > > 
> > > > > > > > > > > > > - Annex-B formats
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I don't think we have really reached a conclusion on the
> > > > > > > > > > > > > pixel formats
> > > > > > > > > > > > > we want to expose. The main issue is how to deal with codecs
> > > > > > > > > > > > > that need
> > > > > > > > > > > > > the full slice NALU with start code, where the slice_header
> > > > > > > > > > > > > is
> > > > > > > > > > > > > duplicated in raw bitstream, when others are fine with just
> > > > > > > > > > > > > the encoded
> > > > > > > > > > > > > slice data and the parsed slice header control.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > My initial thinking was that we'd need 3 formats:
> > > > > > > > > > > > > - One that only takes only the slice compressed data
> > > > > > > > > > > > > (without raw slice
> > > > > > > > > > > > > header and start code);
> > > > > > > > > > > > > - One that takes both the NALU data (including start code,
> > > > > > > > > > > > > raw header
> > > > > > > > > > > > > and compressed data) and slice header controls;
> > > > > > > > > > > > > - One that takes the NALU data but no slice header.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > But I no longer think the latter really makes sense in the
> > > > > > > > > > > > > context of
> > > > > > > > > > > > > stateless video decoding.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > A side-note: I think we should definitely have data offsets
> > > > > > > > > > > > > in every
> > > > > > > > > > > > > case, so that implementations can just push the whole NALU
> > > > > > > > > > > > > regardless
> > > > > > > > > > > > > of the format if they're lazy.
> > > > > > > > > > > > 
> > > > > > > > > > > > I realize that I didn't share our latest research on the
> > > > > > > > > > > > subject. So a
> > > > > > > > > > > > slice in the original bitstream is formed of the following
> > > > > > > > > > > > blocks
> > > > > > > > > > > > 
> > > > > > > > > > > > (simplified):
> > > > > > > > > > > >   [nal_header][nal_type][slice_header][slice]
> > > > > > > > > > > 
> > > > > > > > > > > Thanks for the details!
> > > > > > > > > > > 
> > > > > > > > > > > > nal_header:
> > > > > > > > > > > > This one is a header used to locate the start and the end of
> > > > > > > > > > > > the of a
> > > > > > > > > > > > NAL. There is two standard forms, the ANNEX B / start code, a
> > > > > > > > > > > > sequence
> > > > > > > > > > > > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first
> > > > > > > > > > > > byte
> > > > > > > > > > > > would be a leading 0 from the previous NAL padding, but this
> > > > > > > > > > > > is also
> > > > > > > > > > > > totally valid start code. The second form is the AVC form,
> > > > > > > > > > > > notably used
> > > > > > > > > > > > in ISOMP4 container. It simply is the size of the NAL. You
> > > > > > > > > > > > must keep
> > > > > > > > > > > > your buffer aligned to NALs in this case as you cannot scan
> > > > > > > > > > > > from random
> > > > > > > > > > > > location.
> > > > > > > > > > > > 
> > > > > > > > > > > > nal_type:
> > > > > > > > > > > > It's a bit more then just the type, but it contains at least
> > > > > > > > > > > > the
> > > > > > > > > > > > information of the nal type. This has different size on H.264
> > > > > > > > > > > > and HEVC
> > > > > > > > > > > > but I know it's size is in bytes.
> > > > > > > > > > > > 
> > > > > > > > > > > > slice_header:
> > > > > > > > > > > > This contains per slice parameters, like the modification
> > > > > > > > > > > > lists to
> > > > > > > > > > > > apply on the references. This one has a size in bits, not in
> > > > > > > > > > > > bytes.
> > > > > > > > > > > > 
> > > > > > > > > > > > slice:
> > > > > > > > > > > > I don't really know what is in it exactly, but this is the
> > > > > > > > > > > > data used to
> > > > > > > > > > > > decode. This bit has a special coding called the
> > > > > > > > > > > > anti-emulation, which
> > > > > > > > > > > > prevents a start-code from appearing in it. This coding is
> > > > > > > > > > > > present in
> > > > > > > > > > > > both forms, ANNEX-B or AVC (in GStreamer and some reference
> > > > > > > > > > > > manual they
> > > > > > > > > > > > call ANNEX-B the bytestream format).
> > > > > > > > > > > > 
> > > > > > > > > > > > So, what we notice is that what is currently passed through
> > > > > > > > > > > > Cedrus
> > > > > > > > > > > > 
> > > > > > > > > > > > driver:
> > > > > > > > > > > >   [nal_type][slice_header][slice]
> > > > > > > > > > > > 
> > > > > > > > > > > > This matches what is being passed through VA-API. We can
> > > > > > > > > > > > understand
> > > > > > > > > > > > that stripping off the slice_header would be hard, since it's
> > > > > > > > > > > > size is
> > > > > > > > > > > > in bits. Instead we pass size and header_bit_size in
> > > > > > > > > > > > slice_params.
> > > > > > > > > > > 
> > > > > > > > > > > True, there is that.
> > > > > > > > > > > 
> > > > > > > > > > > > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > > > > > > > > > > > start_code_e, when you turn this off, you don't need start
> > > > > > > > > > > > code. As a
> > > > > > > > > > > > side effect, the bitstream becomes identical. We do now know
> > > > > > > > > > > > that it
> > > > > > > > > > > > works with the ffmpeg branch implement for cedrus.
> > > > > > > > > > > 
> > > > > > > > > > > Oh great, that makes life easier in the short term, but I guess
> > > > > > > > > > > the
> > > > > > > > > > > issue could arise on another decoder sooner or later.
> > > > > > > > > > > 
> > > > > > > > > > > > Now what's special about Hantro G1 (also found on IMX8M) is
> > > > > > > > > > > > that it
> > > > > > > > > > > > take care for us of reading and executing the modification
> > > > > > > > > > > > lists found
> > > > > > > > > > > > in the slice header. Mostly because I very disliked having to
> > > > > > > > > > > > pass the
> > > > > > > > > > > > p/b0/b1 parameters, is that Boris implemented in the driver
> > > > > > > > > > > > the
> > > > > > > > > > > > transformation from the DPB entries into this p/b0/b1 list.
> > > > > > > > > > > > These list
> > > > > > > > > > > > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2.
> > > > > > > > > > > > the
> > > > > > > > > > > > following section is the execution of the modification list.
> > > > > > > > > > > > As this
> > > > > > > > > > > > list is not modified, it only need to be calculated per frame.
> > > > > > > > > > > > As a
> > > > > > > > > > > > result, we don't need these new lists, and we can work with
> > > > > > > > > > > > the same
> > > > > > > > > > > > H264_SLICE format as Cedrus is using.
> > > > > > > > > > > 
> > > > > > > > > > > Yes but I definitely think it makes more sense to pass the list
> > > > > > > > > > > modifications rather than reconstructing those in the driver
> > > > > > > > > > > from a
> > > > > > > > > > > full list. IMO controls should stick to the bitstream as close
> > > > > > > > > > > as
> > > > > > > > > > > possible.
> > > > > > > > > > 
> > > > > > > > > > For Hantro and RKVDEC, the list of modification is parsed by the
> > > > > > > > > > IP
> > > > > > > > > > from the slice header bits. Just to make sure, because I myself
> > > > > > > > > > was
> > > > > > > > > > confused on this before, the slice header does not contain a list
> > > > > > > > > > of
> > > > > > > > > > references, instead it contains a list modification to be applied
> > > > > > > > > > to
> > > > > > > > > > the reference list. I need to check again, but to execute these
> > > > > > > > > > modification, you need to filter and sort the references in a
> > > > > > > > > > specific
> > > > > > > > > > order. This should be what is defined in the spec as 8.2.4.1 and
> > > > > > > > > > 8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1.
> > > > > > > > > > 
> > > > > > > > > > The list of references is deduced from the DPB. The DPB, which I
> > > > > > > > > > thinks
> > > > > > > > > > should be rename as "references", seems more useful then p/b0/b1,
> > > > > > > > > > since
> > > > > > > > > > this is the data that gives use the ability to implementing glue
> > > > > > > > > > in the
> > > > > > > > > > driver to compensate some HW differences.
> > > > > > > > > > 
> > > > > > > > > > In the case of Hantro / RKVDEC, we think it's natural to build the
> > > > > > > > > > HW
> > > > > > > > > > specific lists (p/b0/b1) from the references rather then adding HW
> > > > > > > > > > specific list in the decode_params structure. The fact these lists
> > > > > > > > > > are
> > > > > > > > > > standard intermediate step of the standard is not that important.
> > > > > > > > > 
> > > > > > > > > Sorry I got confused (once more) about it. Boris just explained the
> > > > > > > > > same thing to me over IRC :) Anyway my point is that we want to pass
> > > > > > > > > what's in ffmpeg's short and long term ref lists, and name them that
> > > > > > > > > instead of dpb.
> > > > > > > > > 
> > > > > > > > > > > > Now, this is just a start. For RK3399, we have a different
> > > > > > > > > > > > CODEC
> > > > > > > > > > > > design. This one does not have the start_code_e bit. What the
> > > > > > > > > > > > IP does,
> > > > > > > > > > > > is that you give it one or more slice per buffer, setup the
> > > > > > > > > > > > params,
> > > > > > > > > > > > start decoding, but the decoder then return the location of
> > > > > > > > > > > > the
> > > > > > > > > > > > following NAL. So basically you could offload the scanning of
> > > > > > > > > > > > start
> > > > > > > > > > > > code to the HW. That being said, with the driver layer in
> > > > > > > > > > > > between, that
> > > > > > > > > > > > would be amazingly inconvenient to use, and with Boyer-more
> > > > > > > > > > > > algorithm,
> > > > > > > > > > > > it is pretty cheap to scan this type of start-code on CPU. But
> > > > > > > > > > > > the
> > > > > > > > > > > > feature that this allows is to operate in frame mode. In this
> > > > > > > > > > > > mode, you
> > > > > > > > > > > > have 1 interrupt per frame.
> > > > > > > > > > > 
> > > > > > > > > > > I'm not sure there is any interest in exposing that from
> > > > > > > > > > > userspace and
> > > > > > > > > > > my current feeling is that we should just ditch support for
> > > > > > > > > > > per-frame
> > > > > > > > > > > decoding altogether. I think it mixes decoding with notions that
> > > > > > > > > > > are
> > > > > > > > > > > higher-level than decoding, but I agree it's a blurry line.
> > > > > > > > > > 
> > > > > > > > > > I'm not worried about this either. We can already support that by
> > > > > > > > > > copying the bitstream internally to the driver, though zero-copy
> > > > > > > > > > with
> > > > > > > > > > this would require a new format, the one we talked about,
> > > > > > > > > > SLICE_ANNEX_B.
> > > > > > > > > 
> > > > > > > > > Right, but what I'm thinking about is making that the one and only
> > > > > > > > > format. The rationale is that it's always easier to just append a
> > > > > > > > > start
> > > > > > > > > code from userspace if needed. And we need a bit offset to the slice
> > > > > > > > > data part anyway, so it doesn't hurt to require a few extra bits to
> > > > > > > > > have the whole thing that will work in every situation.
> > > > > > > > 
> > > > > > > > What I'd like is to eventually allow zero-copy (aka userptr) into the
> > > > > > > > driver. If you make the start code mandatory, any decoding from ISOMP4
> > > > > > > > (.mp4, .mov) will require a full bitstream copy in userspace to add
> > > > > > > > the
> > > > > > > > start code (unless you hack your allocation in your demuxer, but it's
> > > > > > > > a
> > > > > > > > bit complicated since this code might come from two libraries). In
> > > > > > > > ISOMP4, you have an AVC header, which is just the size of the NAL that
> > > > > > > > follows.
> > > > > > > 
> > > > > > > Well, I think we have to do a copy from system memory to the buffer
> > > > > > > allocated by v4l2 anyway. Our hardware pipelines can reasonably be
> > > > > > > expected not to have any MMU unit and not allow sg import anyway.
> > > > > > 
> > > > > > The Rockchip has an mmu. You need one copy at least indeed,
> > > > > 
> > > > > Is the MMU in use currently? That can make things troublesome if we run
> > > > > into a case where the VPU has MMU and deals with scatter-gather while
> > > > > the display part doesn't. As far as I know, there's no way for
> > > > > userspace to know whether a dma-buf-exported buffer is backed by CMA or
> > > > > by scatter-gather memory. This feels like a major issue for using dma-
> > > > > buf, since userspace can't predict whether a buffer exported on one
> > > > > device can be imported on another when building its pipeline.
> > > > 
> > > > FYI, Allwinner H6 also has IOMMU, it's just that there is no mainline driver
> > > > for it yet. It is supported for display, both VPUs and some other devices. I
> > > > think no sane SoC designer would left out one or another unit without IOMMU
> > > > support, that just calls for troubles, as you pointed out.
> > > 
> > > Right right, I've been following that from a distance :)
> > > 
> > > Indeed I think it's realistic to expect that for now, but it may not
> > > play out so well in the long term. For instance, maybe connecting a USB
> > > display would require CMA when the rest of the system can do with sg.
> > > 
> > > I think it would really be useful for userspace to have a way to test
> > > whether a buffer can be imported from one device to another. It feels
> > > better than indicating where the memory lives, since there are
> > > countless cases where additional restrictions apply too.
> > 
> > I don't know for the integration on the Rockchip, but I did notice the
> > register documentation for it.
> 
> All the important components in the SoC have their IOMMUs as well -
> display controller, GPU.
> 
> There is a blitter called RGA that is not behind an IOMMU, but has
> some scatter-gather capability (with a need for the hardware sg table
> to be physically contiguous). 

That's definitely good to know and justfies the need to introduce a way
for userspace to check if a buffer can be imported from one device to
another.

> That said, significance of such blitters
> nowadays is rather low, as most of the time you need a compositor on
> the GPU anyway, which can do any transformation in the same pass as
> the composition.

I think that is a crucial mistake and the way I see things, this will
have to change eventually. We cannot keep under-using the fast and
efficient hardware components and going with the war machine that is
the GPU in all situations. This has caused enough trouble in the
GNU/Linux userspace display stack already and I strongly believe it has
to stop.

> > In general, the most significant gain
> > with having iommu for CODECs is that it makes start up (and re-init)
> > time much shorter, but also in a much more predictable duration. I do
> > believe that the Venus driver (qualcomm) is one with solid support for
> > this, and it's quite noticably more snappy then the others.
> 
> Obviously you also get support for USERPTR if you have an IOMMU, but
> that also has some costs - you need to pin the user pages and map to
> the IOMMU before each frame and unmap and unpin after each frame,
> which sometimes is more costly than actually having the userspace copy
> to a preallocated and premapped buffer, especially for relatively
> small contents, such as compressed bitstream.

Heh, interesting point!

Cheers,

Paul

> Best regards,
> Tomasz
> 
> > We also faced an interesting issue recently on IMX.6 (there is just no
> > mmu there). We where playing a stream from the camera, and the
> > framerate would drastically drop as soon as you plug a USB camera (and
> > it would drop for quite a while). We found out that Etnaviv is doing
> > cma allocation per frame, hopefully this won't happen under V4L2
> > queues. But on this platform, starting a new stream while pluggin a USB
> > key could take several seconds to start.
> > 
> > About the RK3399, work will continue in the next couple of weeks, and
> > when this is done, we should have a much wider view of this subject.
> > Hopefully what we learned about H.264 will be useful for HEVC and
> > eventually AV1, which in term of bitstream uses similar stream formats
> > method. AV1 is by far the most complicated CODEC I have read about.
> > 
> > > Cheers,
> > > 
> > > Paul
> > > 
> > > > Best regards,
> > > > Jernej
> > > > 
> > > > > > e.g. file
> > > > > > to mem, or udpsocket to mem. But right now, let's say with ffmpeg/mpeg-
> > > > > > ts, first you need to copy the MPEG TS to mem, then to demux you copy
> > > > > > that H264 stream to another buffer, you then copy in the parser,
> > > > > > removing the start-code and finally copy in the accelerator, adding the
> > > > > > start code. If the driver would allow userptr, it would be unusable.
> > > > > > 
> > > > > > GStreamer on the other side implement lazy conversion, so it would copy
> > > > > > the mpegts to mem, copy to demux, aggregate (with lazy merging) in the
> > > > > > parser (but stream format is negotiation, so it keeps the start-code).
> > > > > > If you request alignment=au, you have full frame of buffers, so if your
> > > > > > driver could do userptr, you can same that extra copy.
> > > > > > 
> > > > > > Now, if we demux an MP4 it's the same, the parser will need do a full
> > > > > > copy instead of lazy aggregation in order to prepend the start code
> > > > > > (since it had an AVC header). But userptr could save a copy.
> > > > > > 
> > > > > > If the driver requires no nal prefix, then we could just pass a
> > > > > > slightly forward point to userptr and avoid ACV to ANNEX-B conversion,
> > > > > > which is a bit slower (even know it's nothing compare to the full
> > > > > > copies we already do.
> > > > > > 
> > > > > > That was my argument in favour for no NAL prefix in term of efficiency,
> > > > > > and it does not prevent adding a control to enable start-code for cases
> > > > > > it make sense.
> > > > > 
> > > > > I see, so the internal arcitecture of userspace software may not be a
> > > > > good fit for adding these bits and it could hurt performance a bit.
> > > > > That feels like a significant downside.
> > > > > 
> > > > > > > So with that in mind, asking userspace to add a startcode it already
> > > > > > > knows doesn't seem to be asking too much.
> > > > > > > 
> > > > > > > > On the other end, the data_offset thing is likely just a thing for the
> > > > > > > > RK3399 to handle, it does not affect RK3288, Cedrus or IMX8M.
> > > > > > > 
> > > > > > > Well, I think it's best to be fool-proof here and just require that
> > > > > > > start code. We should also have per-slice bit offsets to the different
> > > > > > > parts anyway, so drivers that don't need it can just ignore it.
> > > > > > > 
> > > > > > > In extreme cases where there is some interest in doing direct buffer
> > > > > > > import without doing a copy in userspace, userspace could trick the
> > > > > > > format and avoid a copy by not providing the start-code (assuming it
> > > > > > > knows it doesn't need it) and specifying the bit offsets accordingly.
> > > > > > > That'd be a hack for better performance, and it feels better to do
> > > > > > > things in this order rather than having to hack around in the drivers
> > > > > > > that need the start code in every other case.
> > > > > > 
> > > > > > So basically, you and Tomas are both strongly in favour of adding
> > > > > > ANNEX-B start-code to the current uAPI. I have digged into Cedrus
> > > > > > registers, and it seems that it does have start-code scanning support.
> > > > > > I'm not sure it can do "full-frame" decoding, 1 interrupt per frame
> > > > > > like the RK do. That requires the IP to deal with the modifications
> > > > > > lists, which are per slices.
> > > > > 
> > > > > Actually the bitstream parser won't reconfigure the pipeline
> > > > > configuration registers, it's only around for userspace to avoid
> > > > > implementing bitstream parsing, but it's a standalone thing.
> > > > > 
> > > > > So if we want to do full-frame decoding we always need to reconfigure
> > > > > our pipeline (or do it like we do currently and just use one of the
> > > > > per-slice configuration and hope for the best).
> > > > > 
> > > > > Do we have more information on the RK3399 and what it requires exactly?
> > > > > (Just to make sure it's not another issue altogether.)
> > > > > 
> > > > > > My question is, are you willing to adapt the Cedrus driver to support
> > > > > > receiving start-code ? And will this have a performance impact or not ?
> > > > > > On RK side, it's really just about flipping 1 bit.
> > > > > > 
> > > > > > On the Rockchip side, Tomas had concern about CPU wakeup and the fact
> > > > > > that we didn't aim at supporting passing multiple slices at once to the
> > > > > > IP (something RK supports). It's important to understand that multi-
> > > > > > slice streams are relatively rare and mostly used for low-latency /
> > > > > > video conferencing. So aggregating in these case defeats the purpose of
> > > > > > using slices. So I think RK feature is not very important.
> > > > > 
> > > > > Agreed, let's aim for low-latency as a standard.
> > > > > 
> > > > > > Of course, I do believe that long term we will want to expose bot
> > > > > > stream formats on RK (because the HW can do that), so then userspace
> > > > > > can just pick the best when available. So that boils down to our first
> > > > > > idea, shall we expose _SLICE_A and _SLICE_B or something like this ?
> > > > > > Now that we have progressed on the matter, I'm quite in favour of
> > > > > > having _SLICE in the first place, with the preferred format that
> > > > > > everyone should support, and allow for variants later. Now, if we make
> > > > > > one mandatory, we could also just have a menu control to allow other
> > > > > > formats.
> > > > > 
> > > > > That seems fairly reasonable to me, and indeed, having one preferred
> > > > > format at first seems to be a good move.
> > > > > 
> > > > > > > > > To me the breaking point was about having the slice header both in
> > > > > > > > > raw
> > > > > > > > > bitstream and parsed forms. Since we agree that's fine, we might as
> > > > > > > > > well push it to its logical conclusion and include all the bits that
> > > > > > > > > can be useful.
> > > > > > > > 
> > > > > > > > To take your words, the bits that contain useful information starts
> > > > > > > > from the NAL type byte, exactly were the data was cut by VA-API and
> > > > > > > > the
> > > > > > > > current uAPI.
> > > > > > > 
> > > > > > > Agreed, but I think that the advantages of always requiring the start
> > > > > > > code outweigh the potential (yet quite unlikely) downsides.
> > > > > > > 
> > > > > > > > > > > > But it also support slice mode, with an
> > > > > > > > > > > > interrupt per slice, which is what we decided to use.
> > > > > > > > > > > 
> > > > > > > > > > > Easier for everyone and probably better for latency as well :)
> > > > > > > > > > > 
> > > > > > > > > > > > So in this case, indeed we strictly require on start-code.
> > > > > > > > > > > > Though, to
> > > > > > > > > > > > me this is not a great reason to make a new fourcc, so we will
> > > > > > > > > > > > try and
> > > > > > > > > > > > use (data_offset = 3) in order to make some space for that
> > > > > > > > > > > > start code,
> > > > > > > > > > > > and write it down in the driver. This is to be continued, we
> > > > > > > > > > > > will
> > > > > > > > > > > > report back on this later. This could have some side effect in
> > > > > > > > > > > > the
> > > > > > > > > > > > ability to import buffers. But most userspace don't try to do
> > > > > > > > > > > > zero-copy
> > > > > > > > > > > > on the encoded size and just copy anyway.
> > > > > > > > > > > > 
> > > > > > > > > > > > To my opinion, having a single format is a big deal, since
> > > > > > > > > > > > userspace
> > > > > > > > > > > > will generally be developed for one specific HW and we would
> > > > > > > > > > > > endup with
> > > > > > > > > > > > fragmented support. What we really want to achieve is having a
> > > > > > > > > > > > driver
> > > > > > > > > > > > interface which works across multiple HW, and I think this is
> > > > > > > > > > > > quite
> > > > > > > > > > > > possible.
> > > > > > > > > > > 
> > > > > > > > > > > I agree with that. The more I think about it, the more I believe
> > > > > > > > > > > we
> > > > > > > > > > > should just pass the whole
> > > > > > > > > > > [nal_header][nal_type][slice_header][slice]
> > > > > > > > > > > and the parsed list in every scenario.
> > > > > > > > > > 
> > > > > > > > > > What I like of the cut at nal_type, is that there is only format.
> > > > > > > > > > If we
> > > > > > > > > > cut at nal_header, then we need to expose 2 formats. And it makes
> > > > > > > > > > our
> > > > > > > > > > API similar to other accelerator API, so it's easy to "convert"
> > > > > > > > > > existing userspace.
> > > > > > > > > 
> > > > > > > > > Unless we make that cut the single one and only true cut that shall
> > > > > > > > > supersed all other cuts :)
> > > > > > > > 
> > > > > > > > That's basically what I've been trying to do, kill this _RAW/ANNEX_B
> > > > > > > > thing and go back to our first idea.
> > > > > > > 
> > > > > > > Right, in the end I think we should go with:
> > > > > > > V4L2_PIX_FMT_MPEG2_SLICE
> > > > > > > V4L2_PIX_FMT_H264_SLICE
> > > > > > > V4L2_PIX_FMT_HEVC_SLICE
> > > > > > > 
> > > > > > > And just require raw bitstream for the slice with emulation-prevention
> > > > > > > bits included.
> > > > > > 
> > > > > > That's should be the set of format we start with indeed. The single
> > > > > > format for which software gets written and tested, making sure software
> > > > > > support is not fragmented, and other variants should be something to
> > > > > > opt-in.
> > > > > 
> > > > > Cheers for that!
> > > > > 
> > > > > Paul
> > > > > 
> > > > > > > Cheers,
> > > > > > > 
> > > > > > > Paul
> > > > > > > 
> > > > > > > > > > > For H.265, our decoder needs some information from the NAL type
> > > > > > > > > > > too.
> > > > > > > > > > > We currently extract that in userspace and stick it to the
> > > > > > > > > > > slice_header, but maybe it would make more sense to have drivers
> > > > > > > > > > > parse
> > > > > > > > > > > that info from the buffer if they need it. On the other hand, it
> > > > > > > > > > > seems
> > > > > > > > > > > quite common to pass information from the NAL type, so maybe we
> > > > > > > > > > > should
> > > > > > > > > > > either make a new control for it or have all the fields in the
> > > > > > > > > > > slice_header (which would still be wrong in terms of matching
> > > > > > > > > > > bitstream
> > > > > > > > > > > description).
> > > > > > > > > > 
> > > > > > > > > > Even in userspace, it's common to just parse this in place, it's a
> > > > > > > > > > simple mask. But yes, if we don't have it yet, we should expose
> > > > > > > > > > the NAL
> > > > > > > > > > type, it would be cleaner.
> > > > > > > > > 
> > > > > > > > > Right, works for me.
> > > > > > > > 
> > > > > > > > Ack.
> > > > > > > > 
> > > > > > > > > Cheers,
> > > > > > > > > 
> > > > > > > > > Paul
> > > > > > > > > 
> > > > > > > > > > > > > - Dropping the DPB concept in H.264/H.265
> > > > > > > > > > > > > 
> > > > > > > > > > > > > As far as I could understand, the decoded picture buffer
> > > > > > > > > > > > > (DPB) is a
> > > > > > > > > > > > > concept that only makes sense relative to a decoder
> > > > > > > > > > > > > implementation. The
> > > > > > > > > > > > > spec mentions how to manage it with the Hypothetical
> > > > > > > > > > > > > reference decoder
> > > > > > > > > > > > > (Annex C), but that's about it.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > What's really in the bitstream is the list of modified
> > > > > > > > > > > > > short-term and
> > > > > > > > > > > > > long-term references, which is enough for every decoder.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > For this reason, I strongly believe we should stop talking
> > > > > > > > > > > > > about DPB in
> > > > > > > > > > > > > the controls and just pass these lists agremented with
> > > > > > > > > > > > > relevant
> > > > > > > > > > > > > information for userspace.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I think it should be up to the driver to maintain a DPB and
> > > > > > > > > > > > > we could
> > > > > > > > > > > > > have helpers for common cases. For instance, the rockchip
> > > > > > > > > > > > > decoder needs
> > > > > > > > > > > > > to keep unused entries around[2] and cedrus has the same
> > > > > > > > > > > > > requirement
> > > > > > > > > > > > > for H.264. However for cedrus/H.265, we don't need to do any
> > > > > > > > > > > > > book-
> > > > > > > > > > > > > keeping in particular and can manage with the lists from the
> > > > > > > > > > > > > bitstream
> > > > > > > > > > > > > directly.
> > > > > > > > > > > > 
> > > > > > > > > > > > As discusses today, we still need to pass that list. It's
> > > > > > > > > > > > being index
> > > > > > > > > > > > by the HW to retrieve the extra information we have collected
> > > > > > > > > > > > about the
> > > > > > > > > > > > status of the reference frames. In the case of Hantro, which
> > > > > > > > > > > > process
> > > > > > > > > > > > the modification list from the slice header for us, we also
> > > > > > > > > > > > need that
> > > > > > > > > > > > list to construct the unmodified list.
> > > > > > > > > > > > 
> > > > > > > > > > > > So the problem here is just a naming problem. That list is not
> > > > > > > > > > > > really a
> > > > > > > > > > > > DPB. It is just the list of long-term/short-term references
> > > > > > > > > > > > with the
> > > > > > > > > > > > status of these references. So maybe we could just rename as
> > > > > > > > > > > > references/reference_entry ?
> > > > > > > > > > > 
> > > > > > > > > > > What I'd like to pass is the diff to the references list, as
> > > > > > > > > > > ffmpeg
> > > > > > > > > > > currently provides for v4l2 request and vaapi (probably vdpau
> > > > > > > > > > > too). No
> > > > > > > > > > > functional change here, only that we should stop calling it a
> > > > > > > > > > > DPB,
> > > > > > > > > > > which confuses everyone.
> > > > > > > > > > 
> > > > > > > > > > Yes.
> > > > > > > > > > 
> > > > > > > > > > > > > - Using flags
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The current MPEG-2 controls have lots of u8 values that can
> > > > > > > > > > > > > be
> > > > > > > > > > > > > represented as flags. Using flags also helps with padding.
> > > > > > > > > > > > > It's unlikely that we'll get more than 64 flags, so using a
> > > > > > > > > > > > > u64 by
> > > > > > > > > > > > > default for that sounds fine (we definitely do want to keep
> > > > > > > > > > > > > some room
> > > > > > > > > > > > > available and I don't think using 32 bits as a default is
> > > > > > > > > > > > > good enough).
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I think H.264/HEVC per-control flags should also be moved to
> > > > > > > > > > > > > u64.
> > > > > > > > > > > > 
> > > > > > > > > > > > Make sense, I guess bits (member : 1) are not allowed in uAPI
> > > > > > > > > > > > right ?
> > > > > > > > > > > 
> > > > > > > > > > > Mhh, even if they are, it makes it much harder to verify 32/64
> > > > > > > > > > > bit
> > > > > > > > > > > alignment constraints (we're dealing with 64-bit platforms that
> > > > > > > > > > > need to
> > > > > > > > > > > have 32-bit userspace and compat_ioctl).
> > > > > > > > > > 
> > > > > > > > > > I see, thanks.
> > > > > > > > > > 
> > > > > > > > > > > > > - Clear split of controls and terminology
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Some codecs have explicit NAL units that are good fits to
> > > > > > > > > > > > > match as
> > > > > > > > > > > > > controls: e.g. slice header, pps, sps. I think we should
> > > > > > > > > > > > > stick to the
> > > > > > > > > > > > > bitstream element names for those.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > For H.264, that would suggest the following changes:
> > > > > > > > > > > > > - renaming v4l2_ctrl_h264_decode_param to
> > > > > > > > > > > > > v4l2_ctrl_h264_slice_header;
> > > > > > > > > > > > 
> > > > > > > > > > > > Oops, I think you meant slice_prams ? decode_params matches
> > > > > > > > > > > > the
> > > > > > > > > > > > information found in SPS/PPS (combined?), while slice_params
> > > > > > > > > > > > matches
> > > > > > > > > > > > the information extracted (and executed in case of l0/l1) from
> > > > > > > > > > > > the
> > > > > > > > > > > > slice headers.
> > > > > > > > > > > 
> > > > > > > > > > > Yes you're right, I mixed them up.
> > > > > > > > > > > 
> > > > > > > > > > > >  That being said, to me this name wasn't confusing, since
> > > > > > > > > > > > 
> > > > > > > > > > > > it's not just the slice header, and it's per slice.
> > > > > > > > > > > 
> > > > > > > > > > > Mhh, what exactly remains in there and where does it originate
> > > > > > > > > > > in the
> > > > > > > > > > > bitstream? Maybe it wouldn't be too bad to have one control per
> > > > > > > > > > > actual
> > > > > > > > > > > group of bitstream elements.
> > > > > > > > > > > 
> > > > > > > > > > > > > - killing v4l2_ctrl_h264_decode_param and having the
> > > > > > > > > > > > > reference lists
> > > > > > > > > > > > > where they belong, which seems to be slice_header;
> > > > > > > > > > > > 
> > > > > > > > > > > > There reference list is only updated by userspace (through
> > > > > > > > > > > > it's DPB)
> > > > > > > > > > > > base on the result of the last decoding step. I was very
> > > > > > > > > > > > confused for a
> > > > > > > > > > > > moment until I realize that the lists in the slice_header are
> > > > > > > > > > > > just a
> > > > > > > > > > > > list of modification to apply to the reference list in order
> > > > > > > > > > > > to produce
> > > > > > > > > > > > l0 and l1.
> > > > > > > > > > > 
> > > > > > > > > > > Indeed, and I'm suggesting that we pass the modifications only,
> > > > > > > > > > > which
> > > > > > > > > > > would fit a slice_header control.
> > > > > > > > > > 
> > > > > > > > > > I think I made my point why we want the dpb -> references. I'm
> > > > > > > > > > going to
> > > > > > > > > > validate with the VA driver now, to see if the references list
> > > > > > > > > > there is
> > > > > > > > > > usable with our code.
> > > > > > > > > > 
> > > > > > > > > > > Cheers,
> > > > > > > > > > > 
> > > > > > > > > > > Paul
> > > > > > > > > > > 
> > > > > > > > > > > > > I'm up for preparing and submitting these control changes
> > > > > > > > > > > > > and updating
> > > > > > > > > > > > > cedrus if they seem agreeable.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > What do you think?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Paul
> > > > > > > > > > > > > 
> > > > > > > > > > > > > [0]: https://lkml.org/lkml/2019/3/6/82
> > > > > > > > > > > > > [1]: https://patchwork.linuxtv.org/patch/55947/
> > > > > > > > > > > > > [2]:
> > > > > > > > > > > > > https://chromium.googlesource.com/chromiumos/third_party/ke
> > > > > > > > > > > > > rnel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378




[Index of Archives]     [Linux Input]     [Video for Linux]     [Gstreamer Embedded]     [Mplayer Users]     [Linux USB Devel]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [Yosemite Backpacking]

  Powered by Linux