Re: [RFC] Stateful codecs and requirements for compressed formats

Nicolas Dufresne <nicolas@xxxxxxxxxxxx> · Fri, 28 Jun 2019 12:18:33 -0400

Le vendredi 28 juin 2019 à 16:34 +0200, Hans Verkuil a écrit :
> Hi all,
> 
> I hope I Cc-ed everyone with a stake in this issue.
> 
> One recurring question is how a stateful encoder fills buffers and how a stateful
> decoder consumes buffers.
> 
> The most generic case is that an encoder produces a bitstream and just fills each
> CAPTURE buffer to the brim before continuing with the next buffer.
> 
> I don't think there are drivers that do this, I believe that all drivers just
> output a single compressed frame. For interlaced formats I understand it is either
> one compressed field per buffer, or two compressed fields per buffer (this is
> what I heard, I don't know if this is true).
> 
> In any case, I don't think this is specified anywhere. Please correct me if I am
> wrong.
> 
> The latest stateful codec spec is here:
> 
> https://hverkuil.home.xs4all.nl/codec-api/uapi/v4l/dev-mem2mem.html
> 
> Assuming what I described above is indeed the case, then I think this should
> be documented. I don't know enough if a flag is needed somewhere to describe
> the behavior for interlaced formats, or can we leave this open and have userspace
> detect this?
> 
> For decoders it is more complicated. The stateful decoder spec is written with
> the assumption that userspace can just fill each OUTPUT buffer to the brim with
> the compressed bitstream. I.e., no need to split at frame or other boundaries.
> 
> See section 4.5.1.7 in the spec.
> 
> But I understand that various HW decoders *do* have limitations. I would really
> like to know about those, since that needs to be exposed to userspace somehow.

So in "4.5.1.7. Decoding", there is a bit of confusion. The text speaks
about ordered of frames in capture and output, but the bullet points
stays that output buffers aren't frames. The following note about
timestamps creates more confusion, since it says there is potentially,
it's not very affirmative, timestamp matching that let you detect re-
ordering done by the driver, but no clarification on how the timestamp
are to be handle if the packing is random.

What seems entirely missing in what we discussed, is a per format
clarification for the behaviour of codec. I was assuming the NAL
alignment to be documented for H264 and HEVC format. It make sense to
allow some more flexibility since these formats are bytestream with
startcodes, but to be, full-frame behaviour is what existing userspace
expects and we should make this the defacto default. And if the buffer
size ends up too small (badly predicted), I believe we should use the
source change event to allow handling that. That being said, we have
been able to survive this for a long time.

For VP8 and VP9, which don't really have a bytestream format, I do
assume it's logical to enforce full frames always. But if not, special
care is needed to ensure the driver can reconstruct the full frames,
since a firmware won't be able to parse the frame boundaries. Now, when
I saw you taking over, I thought it was clear that this was only the
common bits of the spec and that a per format specification would be
developed later.

> Specifically, the venus decoder needs to know the resolution of the coded video
> beforehand and it expects a single frame per buffer (how does that work for
> interlaced formats?).

If the firmware works in a 1:1 behaviour, with H264 you may have two AU
to compose 1 frame in interlaced stream (and that may change for each
frame). In HEVC you'd always have two AU.

> 
> Such requirements mean that some userspace parsing is still required, so these
> decoders are not completely stateful.

There was a discussion about the meaning of the stateful/stateless.
This is not strictly related to parsing, the amount of parsing being
affected is a side effect. The stateful decoder HW (or firmware) offer
an interface with streams. It hides the state of the decoded stream. As
a side effect, the HW can only be multiplexed if the firmware handles
that. On the other end, stateless decoder offer an API where you
configure the decoding of a frame (and sometimes a slice). Two
consecutive frames do not have to be part of the same stream, which has
the side effect of allowing application to handle their own
multiplexing.

> 
> Can every codec author give information about their decoder/encoder?
> 
> I'll start off with my virtual codec driver:
> 
> vicodec: the decoder fully parses the bitstream. The encoder produces a single
> compressed frame per buffer. This driver doesn't yet support interlaced formats,
> but when that is added it will encode one field per buffer.

I just wanted to highlight that there is lot of behaviour specific to
the formats here. Specially this last one, since it implies that
capture format will be field = ALTERNATE for interlace decoding (this
is a relatively rare format). So the behaviour here can already be
inferred by the capture format (appart that interlace mode cannot be
enumerated, so for encoding, it's a bit of a pain to guess). And there
is already in the spec the information needed to match the pairs (or
detect lost field).

> 
> Let's see what the results are.
> 
> Regards,
> 
> 	Hans
Attachment:
signature.asc

Description: This is a digitally signed message part