Hi, Jean-Marc. ... and thanks for the super-quick response! You have been quite busy. I have had a look through the new draft and I think the additions help considerably with comprehension for the naive (and to give new implementers a way in.) I'll leave you to negotiate with the RFC Editor over the Wikipaedia references. To quote the RFC Style guide http://www.rfc-editor.org/rfc-style-guide/rfc-style Section 4.8, item (x) References, last section: > URLs and DNS names in RFCs > > The use of URLs in RFCs is discouraged, because many URLs are not > stable references. Exceptions may be made for normative > references in those cases where the URL is demonstrably the most > stable reference available. References to long-lived files on > ietf.org and rfc-editor.org are generally acceptable. They are certainly convenient *as long as they remain in place and aren't corrupted*. I found a couple of trivial editorial nits in the changes: s4.3.3 (in the added text): > The CELT layer, however, can adapt over a very wide range of rates, > and thus has a large number of codebooks sizes s/codebooks/codebook/ s4.3.3, para after Table 57: s?the maximums in bit/sample are precomputed?the maximums in bits/sample are precomputed? Also suggest: s4.3: Add reference for Bark scale: Zwicker, E. (1961), "Subdivision of the audible frequency range into critical bands," The Journal of the Acoustical Society of America, 33, Feb., 1961. A few responses in line below (agreed pieces elided): Regards, Elwyn Davies ==================================================================================== On Tue, 2012-05-15 at 20:33 -0400, Jean-Marc Valin wrote: > Hi Elwyn, > > Thanks for the very thorough review. We've addressed your issues and > submitted draft version -13. See our response to each of the issues you > raised (aggregated from all the authors) in the attached document. Thanks very much to all the authors. > > Cheers, > > Jean-Marc > > Elwyn Davies wrote: > > Major issues: > > Can we accept the code as normative? If not how do we proceed? > > The issue with code being normative was specifically addressed in > the guidelines document for this WG (RFC 6569). > Yes. I failed to read this although Robert Sparks did point out to me that the code was normative - but he didn't think he said this was agreed in advance (or maybe I didn't understand what he was saying). To be honest I would like to have had the time to tie the text to the code but realistically that is several weeks or even months work to do it properly - without that I feel that I have only done half a job. I may decide that it interests me enough to have a superficial go next week but no promises! > > > Minor issues: > > Contrast between decoder descriptions of LP part and CELT part: The > > SILK descriptions go into gory detail on the values used in lots of > > tables, etc., whereas the CELT part has a very limited treatment of > the > > numeric values used (assuming reliance on finding the values in the > > reference implementation, either explictly or implicitly). There > are > > things to be said for both techniques. I was wondering (while > reading > > the SILK description) if the authors have any means of automatically > > generating the tables from the code in the SILK part (or vice versa) > to > > avoid double maintenance. On the other hand, there are pieces of the > > CELT decoder description (especially in s4.3.3 where knowing numbers > of > > bands, etc.) where some actual numbers would help comprehension. > > > > We have made many changes to section 4.3 (and 4.3.3 specifically) to > address > the specific issues below. As for the tables, they are not generated > automatically. I think this is addressed satisfactorily now. There is still some difference but it is much reduced and not so glaring now. The addition of the band tables really helps. > > > > s2 (and more generally): The splitting of the signal in the > frequency > > domain into signal (components?) below and above 8kHz presumably > > requires that the signal is subjected to a Discrete Fourier > Transform to > > allow the signal to be split. I think sometging is needed in s2 to > > explain how this is managed (or if I don't understand, to explain > why it > > isn't necessary). > > No DFT is used. The lower band is obtained through resampling (which > is already described) and the higher band is obtained by not coding > the lower band with CELT (the text says that CELT starts at band 17 in > hybrid mode). The explanation was reworded to make this as clear as > possible at this point in the text. [I thought I had reworded this comment in the 2nd version to talk about MDCT but no matter]. Yes, para 5 of s2 does say that the bands are discarded. I think it would useful to have a concrete statement in the new text added to s4.3 that bands 0 to 16 are discarded in hybrid mode (thereby making the 17 in the band boost section more obvious) [There is a comment below that you have added some text about band 17 in section 4.3 but I can't see it]. > > > s4.2.5, para 3: > >> When switching from 20 ms to 10 ms, the 10 ms > >> 20 ms Opus frame, potentially leaving a hole that needs to be > >> concealed from even a single packet loss. > > How? > > As explained in the LBRR text, a 10 ms frame will only contain 10 ms > LBRR data even if the previous frame was 20 ms, so there's 10 ms > "missing". Indeed - that there would be a hole was clear. The 'How' referred to how would it be concealed. Having read further by now this may be down to Packet Loss Concealment - so maybe all it needs is a foward ref to s4.4. > > > s4.3: > > As a complete newcomer to CELT, I would have appreciated a more high > > level understanding of what CELT is doing at this point. I tried > > reading s4.3 without any additional input and found it very hard > going. > > Eventually I gave up and went looking for some additional input. > This > > presentation seems to have a useful view > > http://www.ietf.org/proceedings/79/slides/codec-2.pdf > > I think that it would be extremely helpful to have a description > similar > > to this at this point in the document, even though there is some > > material in section 5.3 which could also be forward referenced. > Still > > the material in s5.3 does not start from the basic principles that > CELT > > is using, and since these are essentially novel, it would be very > good > > to give prospective implementers/users an understanding of what is > going > > on. Incidentally, I found the above IETF presentation more useful > than > > http://www.celt-codec.org/presentations/misc/lca-celt.pdf > > Note that the SILK part seems rather less opaque. It would also be > > useful to indicate numerically how many bands are involved and what > the > > number of MDCT bins are in the various bands. > > The intro of section 4.3 has been expanded with general information > about CELT > similar to what the codec-2.pdf slides from 79 included. > I think this is an excellent improvement. > > > Nits/editorial comments: > > > > global: bytes -> octets > > We believe that in the field of audio codecs, the mention of "byte" > without > further context is well understood to mean 8 bits. True. But this is a matter of IETF style. The style is to use octets where we mean 8 bit bytes. I think you now have a mixture! > > > global: The form/terminology Q<n> (e.g., Q13, Q15, Q16) ought to be > > explained. > > This was already defined in the section on notation: > The notation "Q<n>", where n > is an integer, denotes the number of binary digits to the right of > the decimal point in a fixed-point number. Sorry - I missed that. > > > s3.4: I am not keen on duplicating normative requirements in this > way > > (double maintenance issue). It would be better to put explicit > numbered > > requirements in the sections above an reference the resulting > numbers > > here. > > A checklist style summary is quite useful for implementors, and worth > the maintenance burden for a document that is (hopefully) going to be > published once and read many times. The list intentionally does not > include any RFC 2119 keywords, to avoid any conflict should there > (accidentally) be room to interpret the re-statement any differently > from the original statement. Numbering the requirements and > referencing the numbers is still a good idea, but it should be > possible to read the list without flipping back and forth to the > previous sections. Good solution! > > > s4.1.1: This is really a global point. This section refers to > > entdec.c. Presumably (since we haven't reached the code yet) and it > is > > still compressed, there is some file structure. I don't think this > has > > been said above. It would be good to provide a list of the file > > components (i.e., sectional structure of the code) at the start, > maybe > > even giving line number positions within the decompressed code. > > In cases where it was deemed necessary (e.g. large functions), there > are > indeed line numbers in references to the code. As for a list of files > we did not think it was useful because one already needs to decompress > the code to see the references. OK. We'll have to live with this situation. Having looked at the code, I think it is a considerable pity that it isn't Doxygen commented (or some such) throughout so that the whole system can be viewed as a Doxygen tree. I can smell roasted programmer from here... :-) > > > s4.2.7.5.1, para 1: s/This indexes an element in a coarse codebook, > > selects the PDFs for the second stage of the VQ/This indexes an > > element in a coarse codebook that selects the PDFs for the > second stage > > of the VQ/ > > The text as written is correct. The index I1 is what selects the PDFs > for the second stage, not the vector from the coarse codebook in > Tables 23 and 24. I.e., it's saying, "This does A, B, and C." OK. I think it might be clearer if the three things were separated out as a list. Now you point it out I can read it correctly but it triggered minor confusion - worth turning the three things into bullet points. NEW: s4.3: Add reference for Bark scale: Zwicker, E. (1961), "Subdivision of the audible frequency range into critical bands," The Journal of the Acoustical Society of America, 33, Feb., 1961. Generally the new intro to s4.3 helps *a lot*. > > s4.3.2.1: Avoid the equation picture splitting across page > boundaries. > > in the current version it is unclear what the denominator is. (use > > needLines processing direcrive in xml2rfc). Same applies to the > > equation below Table 57 in s4.3.4.3. > > It's not quite clear how to use needLines without undesirable > side-effects. > Hopefully this is something the RFC editor should be able to handle. Indeed.. but I don't know what undesirable side effects there are? AFAIK (and in my own usage) it just ensures there are n lines available on the current page at some point in the text and forces a page throw if not. > > > > s4.3.3: (was specified as s 4.3.2.3 whcj was wrong) Paragraph on > decoding band boosts: Might be improved by using > > equations rather than the wordy descriptions used at present. Any thoughts on this one > > > > (global)s4.3.2.3, para above table 56: s/iff/if and only if/ > > Fixed. > > > > s4.3.3: <<snip>>. > > Added an explanation of band 17 I don't think this happened. > > > > s6.1: This section is perhaps a little 'triumphalist' for the > reference > > implementation (this may of course be justified!. The quality > metric is > > a 'relative quality metric' and presumably if someone does a > *really* > > good job of coding, it is entirely possible that the new algorithms > > might get better than 100 on the quality score (i.e., provide a > better > > answer than the reference implementation). > > Conformance with the specification is defined by a faithful > reproduction of > the specified decoder behavior (see RFC 6569 s4.3). By specifying the > decoder, future encoders have the freedom to have improved quality > with the > confidence that they know what output a conforming decoder will > produce. > > The tests in s6.1 are measuring the quality of the decoded signal > against the > reference decoded signal, not against the original audio, so greater > than 100% > wouldn't be possible or meaningful. The test signals have been > specially > prepared to thoroughly test a decoder implementation, and they > sacrifice encoded > quality in order to rapidly exercise the corner cases. > You might want to add this comment to the text. As regards the 100 limit, I was sort of assuming that the quality figure was derived from improving on the 48dB SNR figure. Probably a misreading. AS a matter of interest, would one be able to tell from the tests that a putative new implementation really was 'better' in some sense? Or is this now almost a subjective matter that can only be determined by extensive listening tests? I got the impression we may be converging on the diminishing returns point. /Elwyn >