Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit : > Hi, > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit : > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit : > > > Hi, > > > > > > With the Rockchip stateless VPU driver in the works, we now have a > > > better idea of what the situation is like on platforms other than > > > Allwinner. This email shares my conclusions about the situation and how > > > we should update the MPEG-2, H.264 and H.265 controls accordingly. > > > > > > - Per-slice decoding > > > > > > We've discussed this one already[0] and Hans has submitted a patch[1] > > > to implement the required core bits. When we agree it looks good, we > > > should lift the restriction that all slices must be concatenated and > > > have them submitted as individual requests. > > > > > > One question is what to do about other controls. I feel like it would > > > make sense to always pass all the required controls for decoding the > > > slice, including the ones that don't change across slices. But there > > > may be no particular advantage to this and only downsides. Not doing it > > > and relying on the "control cache" can work, but we need to specify > > > that only a single stream can be decoded per opened instance of the > > > v4l2 device. This is the assumption we're going with for handling > > > multi-slice anyway, so it shouldn't be an issue. > > > > My opinion on this is that the m2m instance is a state, and the driver > > should be responsible of doing time-division multiplexing across > > multiple m2m instance jobs. Doing the time-division multiplexing in > > userspace would require some sort of daemon to work properly across > > processes. I also think the kernel is better place for doing resource > > access scheduling in general. > > I agree with that yes. We always have a single m2m context and specific > controls per opened device so keeping cached values works out well. > > So maybe we shall explicitly require that the request with the first > slice for a frame also contains the per-frame controls. > > > > - Annex-B formats > > > > > > I don't think we have really reached a conclusion on the pixel formats > > > we want to expose. The main issue is how to deal with codecs that need > > > the full slice NALU with start code, where the slice_header is > > > duplicated in raw bitstream, when others are fine with just the encoded > > > slice data and the parsed slice header control. > > > > > > My initial thinking was that we'd need 3 formats: > > > - One that only takes only the slice compressed data (without raw slice > > > header and start code); > > > - One that takes both the NALU data (including start code, raw header > > > and compressed data) and slice header controls; > > > - One that takes the NALU data but no slice header. > > > > > > But I no longer think the latter really makes sense in the context of > > > stateless video decoding. > > > > > > A side-note: I think we should definitely have data offsets in every > > > case, so that implementations can just push the whole NALU regardless > > > of the format if they're lazy. > > > > I realize that I didn't share our latest research on the subject. So a > > slice in the original bitstream is formed of the following blocks > > (simplified): > > > > [nal_header][nal_type][slice_header][slice] > > Thanks for the details! > > > nal_header: > > This one is a header used to locate the start and the end of the of a > > NAL. There is two standard forms, the ANNEX B / start code, a sequence > > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first byte > > would be a leading 0 from the previous NAL padding, but this is also > > totally valid start code. The second form is the AVC form, notably used > > in ISOMP4 container. It simply is the size of the NAL. You must keep > > your buffer aligned to NALs in this case as you cannot scan from random > > location. > > > > nal_type: > > It's a bit more then just the type, but it contains at least the > > information of the nal type. This has different size on H.264 and HEVC > > but I know it's size is in bytes. > > > > slice_header: > > This contains per slice parameters, like the modification lists to > > apply on the references. This one has a size in bits, not in bytes. > > > > slice: > > I don't really know what is in it exactly, but this is the data used to > > decode. This bit has a special coding called the anti-emulation, which > > prevents a start-code from appearing in it. This coding is present in > > both forms, ANNEX-B or AVC (in GStreamer and some reference manual they > > call ANNEX-B the bytestream format). > > > > So, what we notice is that what is currently passed through Cedrus > > driver: > > [nal_type][slice_header][slice] > > > > This matches what is being passed through VA-API. We can understand > > that stripping off the slice_header would be hard, since it's size is > > in bits. Instead we pass size and header_bit_size in slice_params. > > True, there is that. > > > About Rockchip. RK3288 is a Hantro G1 and has a bit called > > start_code_e, when you turn this off, you don't need start code. As a > > side effect, the bitstream becomes identical. We do now know that it > > works with the ffmpeg branch implement for cedrus. > > Oh great, that makes life easier in the short term, but I guess the > issue could arise on another decoder sooner or later. > > > Now what's special about Hantro G1 (also found on IMX8M) is that it > > take care for us of reading and executing the modification lists found > > in the slice header. Mostly because I very disliked having to pass the > > p/b0/b1 parameters, is that Boris implemented in the driver the > > transformation from the DPB entries into this p/b0/b1 list. These list > > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2. the > > following section is the execution of the modification list. As this > > list is not modified, it only need to be calculated per frame. As a > > result, we don't need these new lists, and we can work with the same > > H264_SLICE format as Cedrus is using. > > Yes but I definitely think it makes more sense to pass the list > modifications rather than reconstructing those in the driver from a > full list. IMO controls should stick to the bitstream as close as > possible. For Hantro and RKVDEC, the list of modification is parsed by the IP from the slice header bits. Just to make sure, because I myself was confused on this before, the slice header does not contain a list of references, instead it contains a list modification to be applied to the reference list. I need to check again, but to execute these modification, you need to filter and sort the references in a specific order. This should be what is defined in the spec as 8.2.4.1 and 8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1. The list of references is deduced from the DPB. The DPB, which I thinks should be rename as "references", seems more useful then p/b0/b1, since this is the data that gives use the ability to implementing glue in the driver to compensate some HW differences. In the case of Hantro / RKVDEC, we think it's natural to build the HW specific lists (p/b0/b1) from the references rather then adding HW specific list in the decode_params structure. The fact these lists are standard intermediate step of the standard is not that important. > > > Now, this is just a start. For RK3399, we have a different CODEC > > design. This one does not have the start_code_e bit. What the IP does, > > is that you give it one or more slice per buffer, setup the params, > > start decoding, but the decoder then return the location of the > > following NAL. So basically you could offload the scanning of start > > code to the HW. That being said, with the driver layer in between, that > > would be amazingly inconvenient to use, and with Boyer-more algorithm, > > it is pretty cheap to scan this type of start-code on CPU. But the > > feature that this allows is to operate in frame mode. In this mode, you > > have 1 interrupt per frame. > > I'm not sure there is any interest in exposing that from userspace and > my current feeling is that we should just ditch support for per-frame > decoding altogether. I think it mixes decoding with notions that are > higher-level than decoding, but I agree it's a blurry line. I'm not worried about this either. We can already support that by copying the bitstream internally to the driver, though zero-copy with this would require a new format, the one we talked about, SLICE_ANNEX_B. > > > But it also support slice mode, with an > > interrupt per slice, which is what we decided to use. > > Easier for everyone and probably better for latency as well :) > > > So in this case, indeed we strictly require on start-code. Though, to > > me this is not a great reason to make a new fourcc, so we will try and > > use (data_offset = 3) in order to make some space for that start code, > > and write it down in the driver. This is to be continued, we will > > report back on this later. This could have some side effect in the > > ability to import buffers. But most userspace don't try to do zero-copy > > on the encoded size and just copy anyway. > > > > To my opinion, having a single format is a big deal, since userspace > > will generally be developed for one specific HW and we would endup with > > fragmented support. What we really want to achieve is having a driver > > interface which works across multiple HW, and I think this is quite > > possible. > > I agree with that. The more I think about it, the more I believe we > should just pass the whole [nal_header][nal_type][slice_header][slice] > and the parsed list in every scenario. What I like of the cut at nal_type, is that there is only format. If we cut at nal_header, then we need to expose 2 formats. And it makes our API similar to other accelerator API, so it's easy to "convert" existing userspace. > > For H.265, our decoder needs some information from the NAL type too. > We currently extract that in userspace and stick it to the > slice_header, but maybe it would make more sense to have drivers parse > that info from the buffer if they need it. On the other hand, it seems > quite common to pass information from the NAL type, so maybe we should > either make a new control for it or have all the fields in the > slice_header (which would still be wrong in terms of matching bitstream > description). Even in userspace, it's common to just parse this in place, it's a simple mask. But yes, if we don't have it yet, we should expose the NAL type, it would be cleaner. > > > > - Dropping the DPB concept in H.264/H.265 > > > > > > As far as I could understand, the decoded picture buffer (DPB) is a > > > concept that only makes sense relative to a decoder implementation. The > > > spec mentions how to manage it with the Hypothetical reference decoder > > > (Annex C), but that's about it. > > > > > > What's really in the bitstream is the list of modified short-term and > > > long-term references, which is enough for every decoder. > > > > > > For this reason, I strongly believe we should stop talking about DPB in > > > the controls and just pass these lists agremented with relevant > > > information for userspace. > > > > > > I think it should be up to the driver to maintain a DPB and we could > > > have helpers for common cases. For instance, the rockchip decoder needs > > > to keep unused entries around[2] and cedrus has the same requirement > > > for H.264. However for cedrus/H.265, we don't need to do any book- > > > keeping in particular and can manage with the lists from the bitstream > > > directly. > > > > As discusses today, we still need to pass that list. It's being index > > by the HW to retrieve the extra information we have collected about the > > status of the reference frames. In the case of Hantro, which process > > the modification list from the slice header for us, we also need that > > list to construct the unmodified list. > > > > So the problem here is just a naming problem. That list is not really a > > DPB. It is just the list of long-term/short-term references with the > > status of these references. So maybe we could just rename as > > references/reference_entry ? > > What I'd like to pass is the diff to the references list, as ffmpeg > currently provides for v4l2 request and vaapi (probably vdpau too). No > functional change here, only that we should stop calling it a DPB, > which confuses everyone. Yes. > > > > - Using flags > > > > > > The current MPEG-2 controls have lots of u8 values that can be > > > represented as flags. Using flags also helps with padding. > > > It's unlikely that we'll get more than 64 flags, so using a u64 by > > > default for that sounds fine (we definitely do want to keep some room > > > available and I don't think using 32 bits as a default is good enough). > > > > > > I think H.264/HEVC per-control flags should also be moved to u64. > > > > Make sense, I guess bits (member : 1) are not allowed in uAPI right ? > > Mhh, even if they are, it makes it much harder to verify 32/64 bit > alignment constraints (we're dealing with 64-bit platforms that need to > have 32-bit userspace and compat_ioctl). I see, thanks. > > > > - Clear split of controls and terminology > > > > > > Some codecs have explicit NAL units that are good fits to match as > > > controls: e.g. slice header, pps, sps. I think we should stick to the > > > bitstream element names for those. > > > > > > For H.264, that would suggest the following changes: > > > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header; > > > > Oops, I think you meant slice_prams ? decode_params matches the > > information found in SPS/PPS (combined?), while slice_params matches > > the information extracted (and executed in case of l0/l1) from the > > slice headers. > > Yes you're right, I mixed them up. > > > That being said, to me this name wasn't confusing, since > > it's not just the slice header, and it's per slice. > > Mhh, what exactly remains in there and where does it originate in the > bitstream? Maybe it wouldn't be too bad to have one control per actual > group of bitstream elements. > > > > - killing v4l2_ctrl_h264_decode_param and having the reference lists > > > where they belong, which seems to be slice_header; > > > > There reference list is only updated by userspace (through it's DPB) > > base on the result of the last decoding step. I was very confused for a > > moment until I realize that the lists in the slice_header are just a > > list of modification to apply to the reference list in order to produce > > l0 and l1. > > Indeed, and I'm suggesting that we pass the modifications only, which > would fit a slice_header control. I think I made my point why we want the dpb -> references. I'm going to validate with the VA driver now, to see if the references list there is usable with our code. > > Cheers, > > Paul > > > > I'm up for preparing and submitting these control changes and updating > > > cedrus if they seem agreeable. > > > > > > What do you think? > > > > > > Cheers, > > > > > > Paul > > > > > > [0]: https://lkml.org/lkml/2019/3/6/82 > > > [1]: https://patchwork.linuxtv.org/patch/55947/ > > > [2]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378 > > >
Attachment:
signature.asc
Description: This is a digitally signed message part