Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support

Nicolas Dufresne <nicolas@xxxxxxxxxxxx> · Tue, 21 May 2019 12:07:47 -0400

Le mardi 21 mai 2019 à 17:09 +0200, Thierry Reding a écrit :
> On Tue, May 21, 2019 at 01:44:50PM +0200, Paul Kocialkowski wrote:
> > Hi,
> > 
> > On Tue, 2019-05-21 at 19:27 +0900, Tomasz Figa wrote:
> > > On Thu, May 16, 2019 at 2:43 AM Paul Kocialkowski
> > > <paul.kocialkowski@xxxxxxxxxxx> wrote:
> > > > Hi,
> > > > 
> > > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > > > > Hi,
> > > > > > 
> > > > > > With the Rockchip stateless VPU driver in the works, we now have a
> > > > > > better idea of what the situation is like on platforms other than
> > > > > > Allwinner. This email shares my conclusions about the situation and how
> > > > > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > > > > 
> > > > > > - Per-slice decoding
> > > > > > 
> > > > > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > > > > to implement the required core bits. When we agree it looks good, we
> > > > > > should lift the restriction that all slices must be concatenated and
> > > > > > have them submitted as individual requests.
> > > > > > 
> > > > > > One question is what to do about other controls. I feel like it would
> > > > > > make sense to always pass all the required controls for decoding the
> > > > > > slice, including the ones that don't change across slices. But there
> > > > > > may be no particular advantage to this and only downsides. Not doing it
> > > > > > and relying on the "control cache" can work, but we need to specify
> > > > > > that only a single stream can be decoded per opened instance of the
> > > > > > v4l2 device. This is the assumption we're going with for handling
> > > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > > 
> > > > > My opinion on this is that the m2m instance is a state, and the driver
> > > > > should be responsible of doing time-division multiplexing across
> > > > > multiple m2m instance jobs. Doing the time-division multiplexing in
> > > > > userspace would require some sort of daemon to work properly across
> > > > > processes. I also think the kernel is better place for doing resource
> > > > > access scheduling in general.
> > > > 
> > > > I agree with that yes. We always have a single m2m context and specific
> > > > controls per opened device so keeping cached values works out well.
> > > > 
> > > > So maybe we shall explicitly require that the request with the first
> > > > slice for a frame also contains the per-frame controls.
> > > > 
> > > 
> > > Agreed.
> > > 
> > > One more argument not to allow such multiplexing is that despite the
> > > API being called "stateless", there is actually some state saved
> > > between frames, e.g. the Rockchip decoder writes some intermediate
> > > data to some local buffers which need to be given to the decoder to
> > > decode the next frame. Actually, on Rockchip there is even a
> > > requirement to keep the reference list entries in the same order
> > > between frames.
> > 
> > Well, what I'm suggesting is to have one stream per m2m context, but it
> > should certainly be possible to have multiple m2m contexts (multiple
> > userspace open calls) that decode different streams concurrently.
> > 
> > Is that really going to be a problem for Rockchip? If so, then the
> > driver should probably enforce allowing a single userspace open and m2m
> > context at a time.
> 
> If you have hardware storing data necessary to the decoding process in
> buffers local to the decoder you'd have to have some sort of context
> switch operation that backs up the data in those buffers before you
> switch to a different context and restore those buffers when you switch
> back. We have similar hardware on Tegra, though I'm not exactly familiar
> with the details of what is saved and how essential it is. My
> understanding is that those internal buffers can be copied to external
> RAM or vice versa, but I suspect that this isn't going to be very
> efficient. It may very well be that restricting to a single userspace
> open is the most sensible option.

That would be by far the worst for a browser use case where an adds
might have stolen that single instance you have available in HW. It's
normal that context switching will have some impact on performance, but
in general, most of the time, the other instances will be idles by
userspace. If there is not context switches, there should be no (or
very little overhead). Of course, it's should not be a heard
requirement to get a driver in the kernel, I'm not saying that.

p.s. In the IMX8M/Hantro G1 they specifically says that the single core
decoder can handle up to 8 1080p60 streams at the same time. But there
is some buffers being write-back by the IP for every slice (at the end
of the decoded reference frames).

> 
> Thierry
Attachment:
signature.asc

Description: This is a digitally signed message part