Re: Gen-art last call review of draft-ietf-codec-opus-12.txt (completed)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Elwyn,

Thanks for the very thorough review. We've addressed your issues and
submitted draft version -13. See our response to each of the issues you
raised (aggregated from all the authors) in the attached document.

Cheers,

	Jean-Marc

On 12-05-14 09:13 AM, Elwyn Davies wrote:
> I am the assigned Gen-ART reviewer for this draft. For background on 
> Gen-ART, please see the FAQ at 
> <http://wiki.tools.ietf.org/area/gen/trac/wiki/GenArtfaq>. 
> 
> Please resolve these comments along with any other Last Call comments 
> you may receive. 
> 
> Document: draft-ietf-codec-opus-12.txt
> Reviewer:  Elwyn Davies
> Review Date:  14 May 2012 (completed)
> IETF LC End Date: 10 May 2012
> IESG Telechat date: (if known) -
> 
Elwyn Davies wrote:
> Major issues: 
> Can we accept the code as normative?  If not how do we proceed?

The issue with code being normative was specifically addressed in 
the guidelines document for this WG (RFC 6569).

> Minor issues:
> Contrast between decoder descriptions of LP part and CELT part:  The
> SILK descriptions go into gory detail on the values used in lots of
> tables, etc., whereas the CELT part has a very limited treatment of the
> numeric values used (assuming reliance on finding the values in the
> reference implementation, either explictly or implicitly).  There are
> things to be said for both techniques.  I was wondering (while reading
> the SILK description) if the authors have any means of automatically
> generating the tables from the code in the SILK part (or vice versa) to
> avoid double maintenance. On the other hand, there are pieces of the
> CELT decoder description (especially in s4.3.3 where knowing numbers of
> bands, etc.) where some actual numbers would help comprehension.
> 

We have made many changes to section 4.3 (and 4.3.3 specifically) to address
the specific issues below. As for the tables, they are not generated
automatically.


> s2 (and more generally):  The splitting of the signal in the frequency
> domain into signal (components?) below and above 8kHz presumably
> requires that the signal is subjected to a Discrete Fourier Transform to
> allow the signal to be split.  I think sometging is needed in s2 to
> explain how this is managed (or if I don't understand, to explain why it
> isn't necessary).

No DFT is used. The lower band is obtained through resampling (which is already described) and the higher band is obtained by not coding the lower band with CELT (the text says that CELT starts at band 17 in hybrid mode). The explanation was reworded to make this as clear as possible at this point in the text.

> s4.1: Despite having found a copy of the range coding paper and tried to
> read it, I found the whole business of range coding opaque.  Given that
> this is (I think) fairly novel, some additional less opaque description
> could be very helpful to people trying to understand what is going on.
> In particular knowledge of how entropy coding works is pretty much a
> prerequisite.

Added a link to the Wikipedia article on range coding.

> s4.2.5, para 3:
>>     When switching from 20 ms to 10 ms, the 10 ms
>>     20 ms Opus frame, potentially leaving a hole that needs to be
>>     concealed from even a single packet loss.
> How?

As explained in the LBRR text, a 10 ms frame will only contain 10 ms LBRR data even if the previous frame was 20 ms, so there's 10 ms "missing".

> s4.2.5, para 4:
>> In order to properly produce LBRR frames under all conditions, an
>>     encoder might need to buffer up to 60 ms of audio and re-encode it
>>     during these transitions.  However, the reference implementation opts
>>     to disable LBRR frames at the transition point for simplicity.
>>
> Should this be phrased in RFC 2119 requiremenmts language?  The first
> part sounds like a SHOULD with the second part being the get out , but
> its not entirely clear what the consequences are other then simplicity.

There really is no SHOULD there. We're just describing possible strategies for encoding. In general, we have avoided 2119 language as much as possible for the encoder. However, the sentence, "Since transitions are relatively infrequent in normal usage, this does not have a significant impact on packet loss robustness," was added to alleviate any concerns about the consequences of this decision.

> ======================================================================
>              MINOR ISSUES ABOVE submitted as part 1 of review
> ======================================================================
> general/s11.2:  Several references [Hadamard], [Viterbi}, etc., are to
> Wikipaedia pages.  Whilst these are convenient (and only illustrative)
> they are not guaranteed to be very stable.  Better (i.e., more stable)
> references are desirable.

We believe that the links themselves should be quite stable. We could 
also link specific revisions in Wikipedia, but the quality of the content
is more likely to get better than worse. Considering
that these references are informational, we believe that Wikipedia is a
good source.

> s1, para 2:
>>     The decoder contains a great deal of integer and fixed-point
>>     arithmetic which must be performed exactly, including all rounding
>>     considerations,...
> Is this a MUST?  There are instances in the text which might contradict
> this (e.g., para 1 of s4.2.7.5 which has (capitalized) SHOULDs).

This sentence did not mean to imply that _all_ such arithmetic must be performed exactly, only that there exists a good deal of arithmetic for which this is true (for example, to achieve the final range check MUST from Section 6). Since it isn't clear at this point exactly which arithmetic is being referred to, we can't use RFC 2119 strength here. Replaced "must be" with "needs to be" to avoid the impression of an RFC 2119 requirement.

> s4.3:
> As a complete newcomer to CELT, I would have appreciated a more high
> level understanding of what CELT is doing at this point.  I  tried
> reading s4.3 without any additional input and found it very hard going.
> Eventually I gave up and went looking for some additional input.  This
> presentation seems to have a useful view 
> http://www.ietf.org/proceedings/79/slides/codec-2.pdf
> I think that it would be extremely helpful to have a description similar
> to this at this point in the document, even though there is some
> material in section 5.3 which could also be forward referenced.  Still
> the material in s5.3 does not start from the basic principles that CELT
> is using, and since these are essentially novel, it would be very good
> to give prospective implementers/users an understanding of what is going
> on.  Incidentally, I found the above IETF presentation more useful than
> http://www.celt-codec.org/presentations/misc/lca-celt.pdf
> Note that the SILK part seems rather less opaque.  It would also be
> useful to indicate numerically how many bands are involved and what the
> number of MDCT bins are in the various bands. 

The intro of section 4.3 has been expanded with general information about CELT
similar to what the codec-2.pdf slides from 79 included. 

> s4.3, last para: Can the 'out of range error' occur in the LP decoder? (if not why not?)

Added the following text:
"Such out of range errors cannot occur in the SILK layer."


> ======================================================================
> 
> Nits/editorial comments:
> 
> global: bytes ->  octets

We believe that in the field of audio codecs, the mention of "byte" without
further context is well understood to mean 8 bits.

> global: The form/terminology Q<n>  (e.g., Q13, Q15, Q16) ought to be
> explained.

This was already defined in the section on notation:
   The notation "Q<n>", where n
   is an integer, denotes the number of binary digits to the right of
   the decimal point in a fixed-point number. 

> s1: Expand CELP

Done.
 
> s1.1.6: Need to define floor().

Done. Also defined abs(), ceil(), and round().

> s2: ? Expand SILK

SILK is not an acronym.

> s2: Reference for Vorbis.

Linked to http://xiph.org/vorbis/

> s2.1.8: Expand AAC (and MP3).

Expanded
MP3: MPEG 1, Layer 3
AAC: Advanced Audio Coding

> s3: References for Ogg, Matroska, maybe RTP.

Done.

> s3: Fig 1, Table 2 and intervening text:  Presumably SILK only (Table 2)
> etc correspond to MODES 1-3 in Figure 1. This needs to be consistent.

Changed LP to SILK and MDCT to CELT.

> s3.2.1: Make it clear which of the octets is len[0]/len[1].  To be
> precise it might be better to say len0/len1 are the values of the two
> length octets (in whichever order you intend). The form len[0] could be
> misinterpreted as a function 'length of 0'.

Replaced len[0] and len[1] with first_byte and second_byte, respectively.

> s3.2.5: better s/figure below/Figure 5/

Fixed for all cases of "figure below".

> s3.2.5:
>> In the CBR case, the compressed length of each frame in bytes is
>>     equal to the number of remaining bytes in the packet after
>>     subtracting the (optional) padding, (N-2-P), divided by M. This
>>     number MUST be a non-negative integer multiple of M.
> 'This number' is not the  compressed length of each frame that is the
> subject of the first sentence, but the number of remaining octets - this
> needs rewording.

Fixed by defining R=N-2-P, e.g.,
   In the CBR case, let R=N-2-P be the number of bytes remaining in the
   packet after subtracting the (optional) padding.  Then the compressed
   length of each frame in bytes is equal to R/M. The value R MUST be a
   non-negative integer multiple of M. The compressed data for all M
   frames follows, each of size R/M bytes, as illustrated in Figure 6.
 
> s3.2.5:
>> The number of header bytes (TOC byte, frame
>>     count byte, padding length bytes, and frame length bytes), plus the
>>     length of the first M-1 frames themselves, plus the length of the
>>     padding MUST be no larger than N, the total size of the packet.
> Surely this is a non sequitur? This might be better phrased as 'The
> total size of a well formed packet MUST be at least...'

Note that the text mentions the "length of the first M-1 frames", so the calculation does not include the size of the last frame. That being said, we also clarified by using "signaled length" instead of "length".

> s3.3: The example diagrams ought to have figure numbers.

Fixed.

> s3.4: I am not keen on duplicating normative requirements in this way
> (double maintenance issue).  It would be better to put explicit numbered
> requirements in the sections above an reference the resulting numbers
> here.

A checklist style summary is quite useful for implementors, and worth the maintenance burden for a document that is (hopefully) going to be published once and read many times. The list intentionally does not include any RFC 2119 keywords, to avoid any conflict should there (accidentally) be room to interpret the re-statement any differently from the original statement. Numbering the requirements and referencing the numbers is still a good idea, but it should be possible to read the list without flipping back and forth to the previous sections.

> s4.1:
>> The decoder initializes rng to 128 and initializes val to
>>     127 minus the top 7 bits of the first input octet.
> How are the 'top seven bits' to be interpreted here? e.g. as the bottom
> seven bits of a 8 bit integer field? an 8 bit integer with the lowest
> bit zeroed out?

Tried for something more explicit:
   Let b0 be the first input octet (or zero if there are no octets in
   this Opus frame).  The decoder initializes rng to 128 and initializes
   val to (127 - (b0>>1)), where (b0>>1) is the top 7 bits of the first
   input octet.  It saves the remaining bit, (b0&1), for use in the
   renormalization procedure described in ...

> s4.1.1:  This is really a global point.  This section refers to
> entdec.c.  Presumably (since we haven't reached the code yet) and it is
> still compressed, there is some file structure.  I don't think this has
> been said above.  It would be good to provide a list of the file
> components (i.e., sectional structure of the code) at the start, maybe
> even  giving line number positions within the decompressed code.

In cases where it was deemed necessary (e.g. large functions), there are
indeed line numbers in references to the code. As for a list of files
we did not think it was useful because one already needs to decompress
the code to see the references.

> s4.1.1.1:
>> Then it reads the next octet of the
>>     payload and combines it with the left-over bit buffered from the
>>     previous octet to form the 8-bit value sym.  It takes the left-over
>>     bit as the high bit (bit 7) of sym, and the top 7 bits of the octet
>>     it just read as the other 7 bits of sym.
> This is not well phrased.  Better
>       Then it reads the next octet of the payload [packet? payload hasn't
> really been used before] and combines the left  over bit from the

"Packet" is inaccurate, as the next byte in the packet may come from a different Opus frame. Replaced with "Opus frame" instead.

> previous octet (see Section 4.1 for starting this process) as the high
> bit (bit 7)| of 'sym' and the top 7 bits of the octet as the other 7
> bits of sym, leaving the remaining bit for the next iteration.

This may be trying to do too much in a single sentence. Perhaps:
  Then it reads the next octet of the Opus frame and forms an 8-bit
  value sym, using the left-over bit buffered from the previous octet
  as the high bit and the top 7 bits of the octet just read as the
  other 7 bits of sym.  The remaining bit in the octet just read is
  buffered for use in the next iteration.  If no more input octets
  remain, it uses zero bits instead.  See
  <xref target="range-decoder-init"/> for the initialization used to
  process the first octet.

Also broke out range-decoder-init into its own section to make it easier to reference.

> s4.1.5.2:
> Should r_Q15 = rng>>  (l-16) be r_Q15 = rng>>  (lg-16)?  There doesn't
> seem to be an 'l' defined.

Fixed. Sorry; this is a leftover from changing "l" to "lg" in response to one of the AD's review comments.

> s4.2.1: Expand LTP earlier. It would also be useful to expand LPC again.

Done (both).

> s4.2.2: acronym VAD is not expanded until the beginning of s4.2.3.

Fixed.

> s4.2.7: acronym LSF needs to be expanded on first use.

Fixed.

> s4.2.7.1: Explain briefly why Table 7 has values for indices 0 to 15
> when wi0/1 are in range 0 to 14.

Added:
   Although wi0 and wi1 only have 15 possible values, Table 7 contains
   16 entries to allow interpolation between entry wi0 and (wi0 + 1)
   (and likewise for wi1).

> s4.2.7.4, para below Table 12:
>> These 6 bits are combined to form a gain index between 0 and 63.
> s/gain index/gain_index/ as this variable is used subsequently.

Used:
  These 6 bits are combined to form a value, gain_index, between 0 and
  63.
Also updated the "delta gain index" for subframes without an independent gain, below.

> s4.2.7.4: The use of log_gain seems slightly confusing when combined
> with gain_index.  One at least is presumably log scaled.  Maybe a bit
> more explanation is needed.

The gain_index happens to map directly to log_gain for subframes with an independent gain (with the exception of the clamping), but this is not true for subframes where the gain is coded relative to the previous subframe. We are open to suggestions that would make the need for this distinction clearer.

Also replaced "previous gain index" with "previous_log_gain", as it is the latter that actually matters.

> ======================================================================
>              COMMENTS ABOVE submitted as part 1 of review
> ======================================================================
> s4.2.7.2, last para:
>> In that case, if this
>>     flag is zero (indicating that there should be a side channel), then
>>     Packet Loss Concealment (PLC, see Section 4.4) SHOULD be invoked to
>>     recover a side channel signal.
> What are the consequences (or what actions need to be taken) if it is
> not invoked?

Added:
  Otherwise, the stereo image will collapse.

> s4.2.7.5, para 1:
>> These represent the interleaved zeros on the
>>     unit circle between 0 and pi (hence "normalized") in the standard
>>     decomposition of the LPC filter into a symmetric part and an anti-
>>     symmetric part (P and Q in Section 4.2.7.5.6).
> 'on the unit circle between 0 and pi' might be clearer as 'on the upper
> half of the unit circle' or 'on the half of the unit circle in the
> positive imaginary area of the complex plane'.
> 'standard decomposition'?  Needs a reference.

Reworded as:
   These represent the interleaved zeros on the upper half of the unit
   circle (between 0 and pi, hence "normalized") in the standard
   decomposition [line-spectral-pairs] of the LPC filter into a
   symmetric part and an anti-symmetric part (P and Q in Section
   4.2.7.5.6).

Added the reference [line-spectral-pairs] pointing to http://en.wikipedia.org/wiki/Line_spectral_pairs

> s4.2.7.5, para 1: A reference for the use of LSF in LPC would be useful.

The above reference serves this function as well.

> s4.2.7.5.x: There is inconsistent use of stage 1/stage 2 vs
> stage-1/stage-2.  Please be consusistent

"Stage 1" is correct when referring to the noun by itself, or as an object of a verb ("stage 1 decoding"). "Stage-1" is correct when modifying another noun ("stage-1 index"). However, we changed the titles of some of the tables around to allow them to use the modifier form more consistently.

> s4.2.7.5, para 1:
>>     Because of non-linear
>>     effects in the decoding process, an implementation SHOULD match the
>>     fixed-point arithmetic described in this section exactly.  An encoder
>>     SHOULD also use the same process.
> - Does this contradict the 'must' in s1, para 2?

No, for the reasons described above.

> - What are the consequences of ignoring the SHOULD?  How bad would they
> get?  Might it become unstable and how would one know?

They would not become unstable. However, changing the exact arithmetic may
lead to a decoder output that is not close enough to that of the 
reference decoder to pass the conformance test.

> s4.2.7.5.1, para 1: s/This indexes an element in a coarse codebook,
>     selects the PDFs for the second stage of the VQ/This indexes an
>     element in a coarse codebook that selects the PDFs for the second stage
>     of the VQ/

The text as written is correct. The index I1 is what selects the PDFs for the second stage, not the vector from the coarse codebook in Tables 23 and 24. I.e., it's saying, "This does A, B, and C."

> s4.2.7.5.3, last para: 
>> However, nothing in
>>    either the reconstruction process or the quantization process in the
>>    encoder thus far guarantees that the coefficients are monotonically
>>    increasing and separated well enough to ensure a stable filter.
> A reference that indicates why this requirement is needed would be desirable.
> (and also for s4.2.5.7.8).

Added a reference to:
P. Kabal and R. P. Ramachandran, "The Computation of Line Spectral Frequencies 
Using Chebyshev Polynomials", IEEE Trans. Acoustics, Speech, Signal Processing,
vol. 34, no. 6, pp. 1419-1426, Dec. 1986.

> s4.2.7.5.4 and Table 25: Are the values in Table 25 NDeltaMin or NDelatMin_Q15?
> The equations after Table 25 use both NDeltaMin and NDeltaMin_Q15.  Is this correct?
> In particular the first two equations deliver _Q15 values but use raw NDeltaMin.

Good catch. All uses of NDeltaMin should have been NDeltaMin_Q15.

> s4.1.1/s4.2.7.1 and other places:  The term 'exact integer division' is
> used in various places.  My understanding was that this phrase implied
> that it was known that the dividend was an exact multiple of the divisor
> by some out-of-band means.  This doesn't seem to be the case generally
> in Opus (e.g,, where both n/5 and n%5  are needed - clearly this doesn't
> anticipate n%5 == 0 every time!)  So what does 'exact integer division'
> imply?  A definition may be needed. 

The "exact" here was merely 
meant in contrast to inexact floating-point division. Removed "exact", leaving only 
"integer division" (which, like all other operators, uses C conventions, as stated in 
Notation and Conventions). Hopefully this should avoid people trying to read too much into 
the phrase.

> s4.3, last para: s/described in the figure above./described in Table 55 above./

Fixed.

> s4.3.1: 
>> The "transient" flag encoded in the bitstream has a probability of 1/8. 
> This statement appears out of the blue apparently.  Some more
> explanation of what the transient flag actually implies and why we
> should be so sure about its PDF would help.

Added a bit of context and moved the probability 1/8 to later in the text.

> s4.3.2.1: Arguably a reference is needed for the z-transform.

Reference added.

> s4.3.2.1: Avoid the equation picture splitting across page boundaries.
> in the current version it is unclear what the denominator is. (use
> needLines processing direcrive in xml2rfc).  Same applies to the
> equation below Table 57 in s4.3.4.3.

It's not quite clear how to use needLines without undesirable side-effects.
Hopefully this is something the RFC editor should be able to handle.

> 4.3.2.1, after the equations:
>> The
>>    prediction is clamped internally so that fixed point implementations
>>    with limited dynamic range do not suffer desynchronization.
> As a person with limited skills in the srt, I have no idea what
> desynchronization implies here. 

Clarified to "always remain in the same state as floating point implementations".

> 4.3.2.1, ibid:
>> We
>>    approximate the ideal probability distribution of the prediction
>>    error using a Laplace distribution with separate parameters for each
>>    frame size in intra- and inter-frame modes.
> I suspect this sentence belongs before the equation described the z-transform.
> Where are the values of the parameters for the inter-frame mode defined
> (the intra-frame ones are in the text)?

The probability model is for the difference between the energy and its
prediction. Added a reference to the table where the pdf parameters are
located in the code.

> s4.3.2.3: Paragraph on decoding band boosts:  Might be improved by using
> equations rather than the wordy descriptions used at present.
> 
> (global)s4.3.2.3, para above table 56: s/iff/if and only if/

Fixed.

> s4.3.2.3: LOG2_FRAC_TABLE is missing.

Text now says that the table is in rate.c

> s4.3.3: It would be helpful to explain either here, or at the outset of
> s4.3 overall, how the concept of energy bands and MDCT bins applies to
> the CELT part of the codec, and just how many bands and bins are used.
> Some of this is contained in s5.3.2, but the magic number 17 appears
> later in 4.3.3 which is presumably something to do with the point in the
> frequency domain that CELT takes over from LP in the hybrid mode.  It
> would make the very complex section 4.3.3 rather easier to understand
> with this extra information - I have to say I struggled!  On reflection,
> I think an example of what bits are allocated to a band and how thay rae
> subsequently used would be quite helpful - Without going to delve into
> the code I am really not clear that I understand just what bits are
> allocated and what they then encode and I have read the text quite a few
> times now.

Added an explanation of band 17 at the beginning of 4.3. There is also
some text about the number of bins per band.

> s4.3.3: Be consistent between 'tone to noise' and 'tone-to-noise'.

The text was rephrased in terms of signal-to-noise.

> s4.3.3:
>>    The band-energy normalized structure of Opus MDCT mode ensures that a
>>    constant bit allocation for the shape content of a band will result
>>    in a roughly constant tone to noise ratio, which provides for fairly
>>    consistent perceptual performance.  The effectiveness of this
>>    approach is the result of two factors: that the band energy, which is
>>    understood to be perceptually important on its own, is always
>>    preserved regardless of the shape precision, and because the constant
>>    tone-to-noise ratio implies a constant intra-band noise to masking
>>    ratio.  Intra-band masking is the strongest of the perceptual masking
>>    effects.  This structure means that the ideal allocation is more
>>    consistent from frame to frame than it is for other codecs without an
>>    equivalent structure.
> This paragraph contains a number of interesting assertions:  Is there a
> reference where one could see them justified (it may be that this is the
> result of original research in the Opus team).

The particular importance of energy preservation was substantively a discovery
of the Vorbis team during the later tuning of that codec. Vorbis encoders 
include a separate step called 'noise normalization' which attempts to restore
lost energy. In CELT this energy preservation was made implicit in the design of the
codec. 

The text here is intended to explain the motivation behind the particular
band-energy normalized structure of this part of the codec in the hopes
that a higher-level understanding would help make the component steps make
a little more sense. This is "original research", but there is now a link
to an older paper describing this research:
   [Valin2010]
              Valin, JM., Terriberry, T., Montgomery, C., and G.
              Maxwell, "A High-Quality Speech and Audio Codec With Less
              Than 10 ms delay", IEEE Trans. on Audio, Speech and
              Language Processing, Vol. 18, No. 1, pp. 58-67 2010.


> s4.3.3, paragraphs after the bullet points:  The concepts of 'shape' and
> 'shape encoding' is introcuced here without explicit definition.  Are we
> talking about the shape windowing used in FFT/MDCT here? This should be
> made clear.

The concept of shape is now discussed at the beginning of s4.3.

> s4.3.3, 5th para after bullet points: s/In the reference the maximums/In
> the reference implementation the maximums/

Fixed.

> s4.3.3, 6th para after the bullet points:  A table of bands per mode and
> number of MDCT bins  covered would be helpful here in order to  get a
> feeling for the scale of the problem.  Also the cache_caps50 table in
> the code contains the magic number 168.  Where does this come from?

The layout of the band is now included at the beginning of 4.3. Also, section
4.3.3 now includes an explanation for the size of cache_caps50 
(21 bands, 4 values of LM, mono+stereo). 

> s4.3.3, 6th para after the bullet points:
>> set LM to the shift value for the frame size (e.g. 0 for 120, 1 for
>>    240, 3 for 480),  
> Where do these frame sizes get specified? And what is the total set of
> frame sizes? The text says 'e.g.' (which incidentally should be 'e.g.,')
> implying that this is not the complete set.

The calculation of LM is now given earlier in section 4.3.3


> s4.3.3, 6th para after the bullet points: Need to define 'truncating
> integer division' to go with 'exact integer division'.

Replaced with "integer division"

> s4.3.3, 7th para after the bullet points: 
>> The band boosts are represented by a series of binary symbols which
>>    are coded with very low probability.
> How many, at least, and what values?  Are these range encoded? I don't
> see them in the table above or with a PDF specified.

The text has been clarified to say that this is entropy coded. Also, the
probabilities that correspond to 6 bit and 2 bits (1/64 and 1/4, respectively)
are given to emphasize the fact that the number of bits is directly related to
the probabilities.

> s4.3.3, 7th para after the bullet points:
>>    and every time
>>    a band is boosted the initial cost is reduced (down to a minimum of
>>    two).
> Would that be a value of two or two bits? 

Two bits (fixed in the draft).

> s4.3.3: Paragraph on decoding band boosts:  Might be improved by using
> equations rather than the wordy descriptions used at present.
> 
> (global)s4.3.3, para above table 56: s/iff/if and only if/

Fixed.

> s4.3.3, 2nd para after Table 56: 
>> For stereo frames, bits are reserved for intensity stereo and for
>>    dual stereo.  Intensity stereo requires ilog2(end-start) bits.
> The terms 'intenmsity stereo' and 'dual stereo' don't appear to have
> been defined.

Added a definition of intensity, dual, and mid-side stereo to the beginning
of section 4.3. 

> s4.3.3: LOG2_FRAC_TABLE is missing.

Text now says that the table is in rate.c

> 4.3.4.3, last para:
>>    If the decoded vector represents more than one time block, then the
>>    following process is applied separately on each time block. 
> Should this sentence come before the previous paragraph?  There  isn't
> really a 'following process' in this section and I don't think it menas
> the process in s4.3.4.4?


Indeed. Replaced "the following process" with "this spreading process".

> s4.3.4.3, last sentence: 
>> This extra rotation is applied in an interleaved manner with a stride
>>    equal to round(sqrt(N/nb_blocks))
> I think this needs some more explanation for the uninitiated.

Although "stride" is a relatively common term, we added this to the sentence:
"i.e. it is applied independently for each set of sample S_k = {stride*n + k},
n=0..N/stride-1."

> s4.3.4.4:
>> Multiple levels of splitting may be
>>    applied up to a frame size dependent limit. 
> What this limit is does not appear to be defined.

Added the limit of LM+1 splits.

> s4.3.5: The 'collapse' phenomenon is not fully defined, and it would be
> useful to mention why it happens.  Also s/min/minimum/.

Added a short description of the anti-collapse feature and fixed the
s/min/minimum/

> s4.3.6: s/Just like/Just as/

Fixed.

> s4.3.7, last para: I think 'power complementarity' requires further
> explanation or a reference.

Added reference to:
John P. Princen and Alan B. Bradley, "Analysis/synthesis filter bank design based on time domain aliasing cancellation," IEEE Trans. Acoust. Speech Sig. Proc. ASSP-34 (5), 1153-1161 (1986)

> s4.5, para 3: s/To avoid or reduces glitches during these/To avoid or
> reduce glitches during these/

Fixed.

> s4.5.1.1, para 1: s/For for SILK-only/For SILK-only/

Fixed.

> s4.5.1.4, para 2: s/redundant frame is as-is,/redundant frame as-is,/

Fixed.

> s5, figure 16: The Optional High-pass Filter box has two spurious '+'
> symbols on the vertical sides. 

Fixed.

> s5, last para:  A reference for the Auto Regressive Moving Average
> (ARMA) filter would be useful.

This is a simple issue and a reference to the entire theory would have made 
the issue more complicated. The parenthesis now states "i.e. with poles and 
zeros" to clarify this.

> s5.2.3.4.2.1, title: s/Burgs method/Burg's Method/

Fixed.

> s5.2.3.5, para 2: Expand 'R/D performance' (or probably specify it as
> abbreviation for rate-distortion in para 1).

Done.

> s5.3.5, para 1 below equation: Is E an abbreviation for 'extra'?

Oops. s/extra/E/ in the text below.

> s5.3.6: The abbreviation RD for rate-distortion is defined here (see
> comment on s5.2.3.5).

Fixed.

> s6.1: This section is perhaps a little 'triumphalist' for the reference
> implementation (this may of course be justified!.  The quality metric is
> a 'relative quality metric' and presumably if someone does a *really*
> good job of coding, it is entirely possible that the new algorithms
> might get better than 100 on the quality score (i.e., provide a better
> answer than the reference implementation).

Conformance with the specification is defined by a faithful reproduction of
the specified decoder behavior (see RFC 6569 s4.3). By specifying the
decoder, future encoders have the freedom to have improved quality with the
confidence that they know what output a conforming decoder will produce.

The tests in s6.1 are measuring the quality of the decoded signal against the 
reference decoded signal, not against the original audio, so greater than 100%
wouldn't be possible or meaningful. The test signals have been specially
prepared to thoroughly test a decoder implementation, and they sacrifice encoded
quality in order to rapidly exercise the corner cases.

> s6.2: Just wondering, but is non-standard frame size the only option
> offered by Opus Custom?  If not, probably more text is needed here.  Are
> there any major changes to the algorithms implied by the use of Opus
> Custom?

Opus-custom permits more flexibility for frame-sizes at a cost of limiting
the mode switching, requiring more dynamic memory, etc. The algorithm itself
remains exactly the same, though some things like the band count/layout change
to accommodate the different frame sizes while still approximating the same Bark
structure.

> s7.  Referencing SECGUIDE (RFC 3552) seems inappropriate since it occurs
> in such a security considerations section. Just omit it.

Done.

> s11.2: A number of references are to Wikipaedia pages.  While these were
> useful to me in refreshing or initializing my knowledge, they are not
> usually considered adequately stable for use in RFCs.  I fear you may
> have to provide more stable references.

In the cases where we use Wikipedia citations, they are intended as informative 
general references for introductory subject matter, and we are not relying on any
specific details or wording that would be affected by further revisions. While 
the alternatives are often dense academic papers that may not be well suited
to  a general codec implementing audience.


[Index of Archives]     [IETF Annoucements]     [IETF]     [IP Storage]     [Yosemite News]     [Linux SCTP]     [Linux Newbies]     [Fedora Users]