Gen-art 2nd LC review of draft-ietf-payload-vp8-16 (with summary)

Elwyn Davies <elwynd@xxxxxxxxxxxxxx> · Mon, 13 Jul 2015 13:01:54 +0100

I am the assigned Gen-ART reviewer for this draft. For background on
Gen-ART, please see the FAQ at

<http://wiki.tools.ietf.org/area/gen/trac/wiki/GenArtfaq>.

Please resolve these comments along with any other Last Call comments
you may receive.

Document: draft-ietf-payload-vp8-16.txt
Reviewer: Elwyn Davies
Review Date: 2015/07/10
IETF LC End Date: 2015/07/13
IESG Telechat date: (if known) -

Summary: Gosh, it has been a long time since the last review! Almost 
ready.  Still needs various minor nits fixed and some clarifications.

Major issues:

Minor issues:
s4.2: Various of the frame indices either SHOULD or, in the case of
KEYIDX, MAY start at random values.  The rationale(s) for these
requirements are not explained - nor is the reason why it it is only a
MAY for KEYIDX whereas it is SHOULD for PictureID and what might happen
were the MAY/SHOULD to be ignored.

Is the rationale something that should be mentioned in the security
considerations?

s10.1:  The downref to RFC 6386 identified by idnits was duly noted in 
the last call
announcement.  I don't have a problem with the downref.

Nits/editorial comments:

General: Consistent usage of "prediction and mode" vs "prediction/mode" 
would be desirable.

 s1, para 2: Needs to introduce the 'macroblock' jargon defined in RFC 
6386. Suggest:
OLD:
 VP8 is based on decomposition of frames into square sub-blocks of
 pixels, prediction of such sub-blocks using previously constructed
 blocks, and...
NEW:
 VP8 is based on decomposition of frames into square sub-blocks of
 pixels known as "macroblocks" (see Section 2 of [RFC6386]).  Prediction
 of such sub-blocks using previously constructed blocks, and...
END

s2: There are a number of technical terms brought over from RFC 6386.
These should be listed in s2.  At least the following would be in the list:
key frame, interframe, golden frame, altref frame, macroblock.
Also the terms picture selection index (RPSI) and slice loss indication 
(SLI) from RFC 4585.

s3, para 2: Providing a reference to explain what temporal scalability
and the hierarchy selection is all about would be useful.  One possibility
would be:

Lim, J. Yang, and B. Jeon,”Fast Coding Mode Decision for Scalable Video 
Coding”,
10th Int’l Conf. on Advanced Communication Technology, vol. 3, pp. 
1897-1900, 2008.

It would also help to add a pointer to the TL0PICIDX description in 
s4.2, thus:
ADD:
   Support for temporal scalability is provided by the optional 
TL0PICIDX and TID/Y/KEYIDX
fields described in Section 4.2.
END

ss3 and 4: The discussion of how the frame payload might or might not be 
distributed across
multiple packets is not very clear.   The draft has in mind a 'preferred 
use case' (para 1
of s4.4 says:

In the preferred use case, the S bit and PID fields
    described in Section 4.2 should be used to indicate what the packet
    contains.

I think it would be helpful to be a bit more positive about this in s3, 
para 3:
OLD:
   This memo
   allows the partitions to be sent separately or in the same RTP
   packet.  It may be beneficial for decoder error-concealment to send
   the partitions in different packets, even though it is not mandatory
   according to this specification.
NEW:
Accordingly, this document RECOMMENDS that the frame is packetized by 
the sender
with each data partition in a separate packet or packets.  This may be 
beneficial
for decoder error concealment and the payload format described in 
Section 4 provides
fields that allow the partitions to be identified even if the first 
partition is
not available.  The sender can, alternatively, aggregate the data 
partitions into
a single data stream and, optionally, split it into several packets without
consideration of the partition boundaries.  The receiver can use the length
information in the first partition to identify the partitions during 
decoding.
END

s4/Figure 1: s/bytes/octets/ (2 places)

Figure 1, caption: s/sequel/Sections 4.2 and 4.3/

Figure 1: The format shown is correct only for the first packet in a 
frame.  Subsequent
packets will not contain the payload header and will have later octets 
of the frame payload.
This should be mentioned in drawing and/or in the caption.

s4.1, last two paras:
    Sequence number:  The sequence numbers are monotonically increasing
       and set as packets are sent.

       The remaining RTP header fields are used as specified in
       [RFC3550].
Redefining the Sequence Number field seems gratuitous and potentially 
dangerous,
given that it is *very carefully* defined in RFC 3550 and the overall 
intent does
not seem any different from that in RFC 3550.  OTOH a note about the 
(non?-)use of the X bit
and the Payload Type field (PT) would be useful.  I suggest replacing 
the paras above with:
NEW:
   X bit: MUST be 0.  The VP8 RTP profile does not use the RTP Header 
Extension capability.

   PT (Payload Type):  In line with the policy in Section 3 of 
[RFC3551], applications
      using the VP8 RTP payload profile MUST assign a dynamic payload 
type number to
      be used in each RTP session and provide a mechanism to indicate 
the mapping.
      See Section 6.2 for the mechanism to be used with the Session 
Description Protocol (SDP)
      [RFC4566].

      The remaining RTP Fixed Header Fields (V, P, CC, sequence number, 
SSRC and CSRC
      identfiers) are used as specified in Section 5.1 of [RFC5330].
END
Note that this may be considered to make the reference to RFC 3551 
normative.

s4.2, X bit:  There is potential for confusion between the X bit in the 
fixed header
and the X bit in the VP8 Payload Descriptor.  Perhaps changing this to C 
for control bits
would avoid the problem.

s4.2, M bit: Similarly, it might be better to use B (for BIG) in place 
of M to avoid confusion.

s4.2, para 7 after Fig 2: For consistency:
OLD:
   When the X bit is set to 1 in the first octet, the extension bit
   field octet MUST be provided as the second octet.  If the X bit is 0,
   the extension bit field octet MUST NOT be present, and no extensions
   (I, L, T, or K) are permitted.
NEW:
   When the X [or C] bit is set to 1 in the first octet, the Extended 
Control Bits
   field octet MUST be provided as the second octet.  If the X [or C] 
bit is 0,
   the extension bit field octet MUST NOT be present, and no extensions
   (I, L, T, or K) are permitted.
END

s4.2: The issue of using either 'wrap' or 'restart' but not both for 
restarting number sequences has been raised in various comments by IESG 
members.

s4.2, TL0PICIDX: I think, given the statements in s3 about not defining 
the hierarchy, that more explanation is needed of what the 'lowest or 
base layer' actually means.

s4.2, TL0PICIDX:
  If TID is larger than 0, TL0PICIDX
          indicates which base layer frame the current image depends on.
          TL0PICIDX MUST be incremented when TID is 0.
What happens if L=1 but both T=0 and K=0 so that there is no TID value 
present? Or indeed if T=0 but K=1 so that the TID field is there but 
'MUST be ignored by the receiver'  (definition of TID field)?

s4.3: It would be useful to include the section number (9.1) where the 
uncompressed data chunk specification is found.  I was somewhat 
surprised that the ordering is reversed in the protocol spec.

s4.3, VER: The receiver should validate this field - currently only 
values 0-3 are valid.

s4.3, SizeN:  Make it clear these are integer fields: s/in Size0,/in 
integer fields Size0,/

s4.3 and s4.4:  It would be helpful to implementers to explain what the 
offsets in the  first partition payload are for the additional partition 
count and additional partition sizes.  Having rummaged around in RFC 
6386, I have to say that I am unclear whether the values are placed 
after the full chunk of data indicated by 1stPartitionSize or are part 
of this data. And if the latter is the case, how to work out where the 
partition sizes are.  I guess that they ought to be at offset 
(1stPartitionSize+10 - key frame - or 1stPartitionSize+3 - interframe) 
in the VP8 Payload field as it is difficult to work out where they are 
without decoding the partition otherwise.  Also exactly how many 
partition sizes to expect - I think I read in RFC 6386 that the last 
partition size is implicit. Help!

In the course of clearing this up, it might be appropriate to move the 
last sentence of the first para of s4.4 and the second para of s4.4 into 
s4.3 as part of the explanation.  This is the relevant text:
    Note that the length of the first partition can
    always be obtained from the first partition size parameter in the VP8
    payload header.

    The VP8 bitstream format [RFC6386] specifies that if multiple DCT/WHT
    partitions are produced, the location of each partition start is
    found at the end of the first (prediction/mode) partition.  In this
    RTP payload specification, the location offsets are considered to be
    part of the first partition.

s4.4:  I found this section very difficult to parse.  It is a bit 
'stream of consciousness' and would benefit from a reordering and 
rewrite.  Suggestion (assuming the second para gets moved to s4.3):
NEW SECTION 4.4:
   An encoded VP8 frame can be divided into two or more partitions, as
   described in Section 1.  It is OPTIONAL for a packetizer implementing
   this RTP specification
   to pay attention to the partition boundaries within an encoded frame.
   If packetization of a frame is done without considering the partition
   boundaries, the PID field MAY be set to zero for all packets, and the
   S bit MUST NOT be set to one for any other packet than the first.

   If the preferred usage suggested in Section 3 is followed, with each 
packet
   carrying data from exactly one partition, the S bit and PID fields
   described in Section 4.2 SHOULD be used to indicate what the packet
   contains.  The PID field should indicate which partition the first
   octet of the payload belongs to, and the S bit indicates that the
   packet starts on a new partition.

   If the packetizer does not pay attention to the partition 
boundaries, one
   packet can contain a fragment of a  partition, a complete partition, 
or an
   aggregate of fragments and partitions.  There is no explicit 
signaling of
   partition boundaries in the payload and the partition lengths at the 
end of
   the first partition have to be used to identify the boundaries. 
Partitions
   MUST be aggregated in decoding order.  Two fragments from different
   partitions MAY be aggregated into the same packet along with one or more
   complete partitions.

  In all cases, the payload of a packet MUST contain data from only one 
video
  frame.  Consequently the set of packets carrying the data from a 
particular
  frame will contain exactly one VP8 Payload Header (see Section 4.3) 
carried
  in the first packet of the frame.  The last, or only, packet carrying 
data for the
  frame MUST have the M bit set in the RTP header.

s4.5.1:  Th algorithm only applies for the RECOMMENDED use case with 
partitions in separate packets.
This should be noted.

s5.2, last para: s/only parts of the frame is corrupted./only part of 
the frame is corrupted./

s6: s/This payload format has two required parameters./This payload 
format has two optional parameters./
[I think this is probably a hangover from the previous version.]

s7: s/Cryptographic system may also allow/The cryptographic system may 
also allow/

s7:
the usage of SRTP [RFC3711] is recommended.
RECOMMENDED?

s10.2: I suspect that RFC 3551 is normative.