I am the assigned Gen-ART reviewer for this draft. For background on Gen-ART, please see the FAQ at <http://wiki.tools.ietf.org/area/gen/trac/wiki/GenArtfaq>. Please resolve these comments along with any other Last Call comments you may receive. Document: draft-ietf-codec-opus-12.txt Reviewer: Elwyn Davies Review Date: 14 May 2012 (completed) IETF LC End Date: 10 May 2012 IESG Telechat date: (if known) - Summary: Before offering some views on the document, let me say that this piece of work seems to be a tour de force on behalf of its developers. It is certainly one of the (if not the) most technically complex pieces of work that has been presented to the IETF, and is far more mathematically complex and rigorous than our usual run of standards. It is also perhaps a vindication of the rough concensus and running code philosophy. So congratulations to the authors and the companies that have supported this work. But back to the review.... I came to this review with a very outdated view of what goes on inside codecs, so it was rather a shock to the system. Taking that into account, I think that some additional high level, up front description of the two parts (SILK and CELT) and the range encoding system would make it easier for future (naive) readers and potential implementers to understand what is going on. For example, after struggling with the description of CELT in the decoder (s4.3) I found that going off and reading the slides on CELT presented at IETF 79 helped considerably. Such additions might also help future dissemination of the technique by giving potential users a quick overview. It would also make it easier to justify the decision to postpone the high level (and highly mathematical/technical) view to section 5 after the detailed description of the decoder. I also found that the descriptions of the SILK and CELT decoders appeared to be written with a different perspective (see detailed comments) doubtless by different authors. This is not ideal and, despite the expected further increase in page count, some additional text in s4.3 seems desirable. By far the most difficult to parse section is 4.3.3 talking about bit allocation in CELT. Despite considerable study, I didn't feel I had really got to grips with how this worked. An example might help. Finally, the most contentious aspect of this document is the requirement to treat the code as normative. There are just over 30,000 physical lines of code and I haven't checked them hardly at all. Slocount reckons this represents 7.14 man years of effort. As with the document, there is a discrepancy between CELT and SILK with the latter code being more heavily commented, especially as regards routine parameters. A proper validation of the claim that the code implements the description would take several weeks of time: I guess that we have to take the document shepherd's assurances on this. One issue is that both code and document are pretty much saturated with what we used to call 'magic numbers'. Although these are explained in the document, it does not appear that this is always the case in the code. I would also be happier if the code contained Doxygen (or similar) dcoumentation statements for all routines (rather than just the API). So overall, the work is something that ought to be standardized but the document needs further work - not quite ready for the IESG. Major issues: Can we accept the code as normative? If not how do we proceed? Minor issues: Contrast between decoder descriptions of LP part and CELT part: The SILK descriptions go into gory detail on the values used in lots of tables, etc., whereas the CELT part has a very limited treatment of the numeric values used (assuming reliance on finding the values in the reference implementation, either explictly or implicitly). There are things to be said for both techniques. I was wondering (while reading the SILK description) if the authors have any means of automatically generating the tables from the code in the SILK part (or vice versa) to avoid double maintenance. On the other hand, there are pieces of the CELT decoder description (especially in s4.3.3 where knowing numbers of bands, etc.) where some actual numbers would help comprehension. s2 (and more generally): The splitting of the signal in the frequency domain into signal (components?) below and above 8kHz in the hybrid case presumably requires that the output of the CELT encoder is windowed so that only one of encoders gnerates output below 8kHZ. I think something is needed in s2 to explain how this is managed (presumably to do with energy bands?). I didn't find anything in s5 about what has to be done for the encoder when running in hybrid mode which surprised me somewhat. s4.1: Despite having found a copy of the range coding paper and tried to read it, I found the description of range coding opaque (there are some more words later in s5 but this is a bit late for understanding the decoder). Given that this is (I think) fairly novel, some additional less opaque description could be very helpful to people trying to understand what is going on. In particular knowledge of how entropy coding works is pretty much a prerequisite. s4.2.5, para 3: > When switching from 20 ms to 10 ms, the 10 ms > 20 ms Opus frame, potentially leaving a hole that needs to be > concealed from even a single packet loss. How? s4.2.5, para 4: > In order to properly produce LBRR frames under all conditions, an > encoder might need to buffer up to 60 ms of audio and re-encode it > during these transitions. However, the reference implementation opts > to disable LBRR frames at the transition point for simplicity. > Should this be phrased in RFC 2119 requiremenmts language? The first part sounds like a SHOULD with the second part being the get out , but its not entirely clear what the consequences are other then simplicity. ====================================================================== MINOR ISSUES ABOVE submitted as part 1 of review ====================================================================== general/s11.2: Several references [Hadamard], [Viterbi}, etc., are to Wikipaedia pages. Whilst these are convenient (and only illustrative) they are not guaranteed to be very stable. Better (i.e., more stable) references are desirable. s1, para 2: > The decoder contains a great deal of integer and fixed-point > arithmetic which must be performed exactly, including all rounding > considerations,... Is this a MUST? There are instances in the text which might contradict this (e.g., para 1 of s4.2.7.5 which has (capitalized) SHOULDs). s4.3: As a complete newcomer to CELT, I would have appreciated a more high level understanding of what CELT is doing at this point. I tried reading s4.3 without any additional input and found it very hard going. Eventually I gave up and went looking for some additional input. This presentation seems to have a useful view http://www.ietf.org/proceedings/79/slides/codec-2.pdf I think that it would be extremely helpful to have a description similar to this at this point in the document, even though there is some material in section 5.3 which could also be forward referenced. Still the material in s5.3 does not start from the basic principles that CELT is using, and since these are essentially novel, it would be very good to give prospective implementers/users an understanding of what is going on. Incidentally, I found the above IETF presentation more useful than http://www.celt-codec.org/presentations/misc/lca-celt.pdf Note that the SILK part seems rather less opaque. It would also be useful to indicate numerically how many bands are involved and what the number of MDCT bins are in the various bands. s4.3, last para: Can the 'out of range error' occur in the LP decoder? (if not why not?) ====================================================================== Nits/editorial comments: global: bytes -> octets global: The form/terminology Q<n> (e.g., Q13, Q15, Q16) ought to be explained. s1: Expand CELP s1.1.6: Need to define floor(). s2: ? Expand SILK s2: Reference for Vorbis. s2.1.8: Expand AAC (and MP3). s3: References for Ogg, Matroska, maybe RTP. s3: Fig 1, Table 2 and intervening text: Presumably SILK only (Table 2) etc correspond to MODES 1-3 in Figure 1. This needs to be consistent. s3.2.1: Make it clear which of the octets is len[0]/len[1]. To be precise it might be better to say len0/len1 are the values of the two length octets (in whichever order you intend). The form len[0] could be misinterpreted as a function 'length of 0'. s3.2.5: better s/figure below/Figure 5/ s3.2.5: > In the CBR case, the compressed length of each frame in bytes is > equal to the number of remaining bytes in the packet after > subtracting the (optional) padding, (N-2-P), divided by M. This > number MUST be a non-negative integer multiple of M. 'This number' is not the compressed length of each frame that is the subject of the first sentence, but the number of remaining octets - this needs rewording. s3.2.5: > The number of header bytes (TOC byte, frame > count byte, padding length bytes, and frame length bytes), plus the > length of the first M-1 frames themselves, plus the length of the > padding MUST be no larger than N, the total size of the packet. Surely this is a non sequitur? This might be better phrased as 'The total size of a well formed packet MUST be at least...' s3.3: The example diagrams ought to have figure numbers. s3.4: I am not keen on duplicating normative requirements in this way (double maintenance issue). It would be better to put explicit numbered requirements in the sections above an reference the resulting numbers here. s4.1: > The decoder initializes rng to 128 and initializes val to > 127 minus the top 7 bits of the first input octet. How are the 'top seven bits' to be interpreted here? e.g. as the bottom seven bits of a 8 bit integer field? an 8 bit integer with the lowest bit zeroed out? s4.1.1: This is really a global point. This section refers to entdec.c. Presumably (since we haven't reached the code yet) and it is still compressed, there is some file structure. I don't think this has been said above. It would be good to provide a list of the file components (i.e., sectional structure of the code) at the start, maybe even giving line number positions within the decompressed code. s4.1.1.1: > Then it reads the next octet of the > payload and combines it with the left-over bit buffered from the > previous octet to form the 8-bit value sym. It takes the left-over > bit as the high bit (bit 7) of sym, and the top 7 bits of the octet > it just read as the other 7 bits of sym. This is not well phrased. Better Then it reads the next octet of the payload [packet? payload hasn't really been used before] and combines the left over bit from the previous octet (see Section 4.1 for starting this process) as the high bit (bit 7)| of 'sym' and the top 7 bits of the octet as the other 7 bits of sym, leaving the remaining bit for the next iteration. s4.1.5.2: Should r_Q15 = rng >> (l-16) be r_Q15 = rng >> (lg-16)? There doesn't seem to be an 'l' defined. s4.2.1: Expand LTP earlier. It would also be useful to expand LPC again. s4.2.2: acronym VAD is not expanded until the beginning of s4.2.3. s4.2.7: acronym LSF needs to be expanded on first use. s4.2.7.1: Explain briefly why Table 7 has values for indices 0 to 15 when wi0/1 are in range 0 to 14. s4.2.7.4, para below Table 12: > These 6 bits are combined to form a gain index between 0 and 63. s/gain index/gain_index/ as this variable is used subsequently. s4.2.7.4: The use of log_gain seems slightly confusing when combined with gain_index. One at least is presumably log scaled. Maybe a bit more explanation is needed. ====================================================================== COMMENTS ABOVE submitted as part 1 of review ====================================================================== General: Define L1 and L2 (as in L1-norm, etc). s4.2.7.2, last para: > In that case, if this > flag is zero (indicating that there should be a side channel), then > Packet Loss Concealment (PLC, see Section 4.4) SHOULD be invoked to > recover a side channel signal. What are the consequences (or what actions need to be taken) if it is not invoked? s4.2.7.5, para 1: > These represent the interleaved zeros on the > unit circle between 0 and pi (hence "normalized") in the standard > decomposition of the LPC filter into a symmetric part and an anti- > symmetric part (P and Q in Section 4.2.7.5.6). 'on the unit circle between 0 and pi' might be clearer as 'on the upper half of the unit circle' or 'on the half of the unit circle in the positive imaginary area of the complex plane'. 'standard decomposition'? Needs a reference. s4.2.7.5, para 1: A reference for the use of LSF in LPC would be useful. s4.2.7.5.x: There is inconsistent use of stage 1/stage 2 vs stage-1/stage-2. Please be consistent. s4.2.7.5, para 1: > Because of non-linear > effects in the decoding process, an implementation SHOULD match the > fixed-point arithmetic described in this section exactly. An encoder > SHOULD also use the same process. - Does this contradict the 'must' in s1, para 2? - What are the consequences of ignoring the SHOULD? How bad would they get? Might it become unstable and how would one know? s4.2.7.5.1, para 1: s/This indexes an element in a coarse codebook, selects the PDFs for the second stage of the VQ/This indexes an element in a coarse codebook that selects the PDFs for the second stage of the VQ/ s4.2.7.5.3, last para: > However, nothing in > either the reconstruction process or the quantization process in the > encoder thus far guarantees that the coefficients are monotonically > increasing and separated well enough to ensure a stable filter. A reference that indicates why this requirement is needed would be desirable. (and also for s4.2.5.7.8). s4.2.7.5.4 and Table 25: Are the values in Table 25 NDeltaMin or NDelatMin_Q15? The equations after Table 25 use both NDeltaMin and NDeltaMin_Q15. Is this correct? In particular the first two equations deliver _Q15 values but use raw NDeltaMin. s4.1.1/s4.2.7.1 and other places: The term 'exact integer division' is used in various places. My understanding was that this phrase implied that it was known that the dividend was an exact multiple of the divisor by some out-of-band means. This doesn't seem to be the case generally in Opus (e.g,, where both n/5 and n%5 are needed - clearly this doesn't anticipate n%5 == 0 every time!) So what does 'exact integer division' imply? A definition may be needed. s4.3, last para: s/described in the figure above./described in Table 55 above./ s4.3.1: > The "transient" flag encoded in the bitstream has a probability of 1/8. This statement appears out of the blue apparently. Some more explanation of what the transient flag actually implies and why we should be so sure about its PDF would help. s4.3.2.1: Arguably a reference is needed for the z-transform. s4.3.2.1: Avoid the equation picture splitting across page boundaries. in the current version it is unclear what the denominator is. (use needLines processing direcrive in xml2rfc). Same applies to the equation below Table 57 in s4.3.4.3. 4.3.2.1, after the equations: > The > prediction is clamped internally so that fixed point implementations > with limited dynamic range do not suffer desynchronization. As a person with limited skills in the srt, I have no idea what desynchronization implies here. 4.3.2.1, ibid: > We > approximate the ideal probability distribution of the prediction > error using a Laplace distribution with separate parameters for each > frame size in intra- and inter-frame modes. I suspect this sentence belongs before the equation described the z-transform. Where are the values of the parameters for the inter-frame mode defined (the intra-frame ones are in the text)? s4.3.2.3: Paragraph on decoding band boosts: Might be improved by using equations rather than the wordy descriptions used at present. (global)s4.3.2.3, para above table 56: s/iff/if and only if/ s4.3.2.3: LOG2_FRAC_TABLE is missing. s4.3.3: It would be helpful to explain either here, or at the outset of s4.3 overall, how the concept of energy bands and MDCT bins applies to the CELT part of the codec, and just how many bands and bins are used. Some of this is contained in s5.3.2, but the magic number 17 appears later in 4.3.3 which is presumably something to do with the point in the frequency domain that CELT takes over from LP in the hybrid mode. It would make the very complex section 4.3.3 rather easier to understand with this extra information - I have to say I struggled! On reflection, I think an example of what bits are allocated to a band and how thay rae subsequently used would be quite helpful - Without going to delve into the code I am really not clear that I understand just what bits are allocated and what they then encode and I have read the text quite a few times now. s4.3.3: Be consistent between 'tone to noise' and 'tone-to-noise'. s4.3.3: > The band-energy normalized structure of Opus MDCT mode ensures that a > constant bit allocation for the shape content of a band will result > in a roughly constant tone to noise ratio, which provides for fairly > consistent perceptual performance. The effectiveness of this > approach is the result of two factors: that the band energy, which is > understood to be perceptually important on its own, is always > preserved regardless of the shape precision, and because the constant > tone-to-noise ratio implies a constant intra-band noise to masking > ratio. Intra-band masking is the strongest of the perceptual masking > effects. This structure means that the ideal allocation is more > consistent from frame to frame than it is for other codecs without an > equivalent structure. This paragraph contains a number of interesting assertions: Is there a reference where one could see them justified (it may be that this is the result of original research in the Opus team). s4.3.3, paragraphs after the bullet points: The concepts of 'shape' and 'shape encoding' is introcuced here without explicit definition. Are we talking about the shape windowing used in FFT/MDCT here? This should be made clear. s4.3.3, 5th para after bullet points: s/In the reference the maximums/In the reference implementation the maximums/ s4.3.3, 6th para after the bullet points: A table of bands per mode and number of MDCT bins covered would be helpful here in order to get a feeling for the scale of the problem. Also the cache_caps50 table in the code contains the magic number 168. Where does this come from? s4.3.3, 6th para after the bullet points: > set LM to the shift value for the frame size (e.g. 0 for 120, 1 for > 240, 3 for 480), Where do these frame sizes get specified? And what is the total set of frame sizes? The text says 'e.g.' (which incidentally should be 'e.g.,') implying that this is not the complete set. s4.3.3, 6th para after the bullet points: Need to define 'truncating integer division' to go with 'exact integer division'. s4.3.3, 7th para after the bullet points: > The band boosts are represented by a series of binary symbols which > are coded with very low probability. How many, at least, and what values? Are these range encoded? I don't see them in the table above or with a PDF specified. s4.3.3, 7th para after the bullet points: > and every time > a band is boosted the initial cost is reduced (down to a minimum of > two). Would that be a value of two or two bits? s4.3.3: Paragraph on decoding band boosts: Might be improved by using equations rather than the wordy descriptions used at present. (global)s4.3.3, para above table 56: s/iff/if and only if/ s4.3.3, 2nd para after Table 56: > For stereo frames, bits are reserved for intensity stereo and for > dual stereo. Intensity stereo requires ilog2(end-start) bits. The terms 'intenmsity stereo' and 'dual stereo' don't appear to have been defined. s4.3.3: LOG2_FRAC_TABLE is missing. 4.3.4.3, last para: > If the decoded vector represents more than one time block, then the > following process is applied separately on each time block. Should this sentence come before the previous paragraph? There isn't really a 'following process' in this section and I don't think it menas the process in s4.3.4.4? s4.3.4.3, last sentence: > This extra rotation is applied in an interleaved manner with a stride > equal to round(sqrt(N/nb_blocks)) I think this needs some more explanation for the uninitiated. s4.3.4.4: > Multiple levels of splitting may be > applied up to a frame size dependent limit. What this limit is does not appear to be defined. s4.3.5: The 'collapse' phenomenon is not fully defined, and it would be useful to mention why it happens. Also s/min/minimum/. s4.3.6: s/Just like/Just as/ s4.3.7, last para: I think 'power complementarity' requires further explanation or a reference. s4.5, para 3: s/To avoid or reduces glitches during these/To avoid or reduce glitches during these/ s4.5.1.1, para 1: s/For for SILK-only/For SILK-only/ s4.5.1.4, para 2: s/redundant frame is as-is,/redundant frame as-is,/ s5, figure 16: The Optional High-pass Filter box has two spurious '+' symbols on the vertical sides. s5, last para: A reference for the Auto Regressive Moving Average (ARMA) filter would be useful. s5.2.3.4.2.1, title: s/Burgs method/Burg's Method/ s5.2.3.5, para 2: Expand 'R/D performance' (or probably specify it as abbreviation for rate-distortion in para 1). s5.3.5, para 1 below equation: Is E an abbreviation for 'extra'? s5.3.6: The abbreviation RD for rate-distortion is defined here (see comment on s5.2.3.5). s6.1: This section is perhaps a little 'triumphalist' for the reference implementation (this may of course be justified!. The quality metric is a 'relative quality metric' and presumably if someone does a *really* good job of coding, it is entirely possible that the new algorithms might get better than 100 on the quality score (i.e., provide a better answer than the reference implementation). s6.2: Just wondering, but is non-standard frame size the only option offered by Opus Custom? If not, probably more text is needed here. Are there any major changes to the algorithms implied by the use of Opus Custom? s7. Referencing SECGUIDE (RFC 3552) seems inappropriate since it occurs in such a security considerations section. Just omit it. s11.2: A number of references are to Wikipaedia pages. While these were useful to me in refreshing or initializing my knowledge, they are not usually considered adequately stable for use in RFCs. I fear you may have to provide more stable references. Appendix A: I checked that the code extracted as inicated and was able to be compiled under Ubuntu 10.04 LTS.