Executive summary for those who don't like my long and detailed messages: In order to accommodate the extended mailboxes permitted by RFC 6530 etc., this spec allows non-ASCII addresses. However, the use of almost-arbitrary Unicode strings (in UTF-8) introduces issues of ambiguity in lower casing and normalization, issues that are not addressed by either the document of Edwin's proposal. The document itself appears to me to be unacceptable, even for publication as a recommended experiment, unless those issues are addressed. Details inline. --On Sunday, February 14, 2016 3:04 PM +0000 E Taylor <hagfish@xxxxxxxxxxxx> wrote: >... > On the topic of lowercasing, it seems that there are still > differing opinions, and this is potentially reflected by the > implementations that now exist. For example, the author has > pointed out some examples of clients which force lowercasing, > and I've checked that Mail.de (which supports adding OpenPGP > key information to the DNS[0]) replaces uppercase letters with > their lowercase equivalent when choosing a username at sign-up > (so presumably store only an entry for the lowercase version > in the DNS). The online tester at openpgpkey.info[1] (run by > Mail.de) also forces email addresses to lowercase before > searching. >... > My suggestion for a consensus, therefore, is that the draft > recommend that clients attempt the case sensitive lookup > first, and then fall back to a lowercase lookup if that fails > (ideally informing the user that it has done this). For the > rare situation where a user specifies an email address with > uppercase characters in, this will result in an extra query, > but in the rarer situation that the lowercase version doesn't > exist (or represents a different user) then this provides a > worthwhile security benefit. Moreover, I think that if the > draft doesn't mention the possibility of lowercasing, then > client implementers will either force lowercasing out of > habit, or make their software search for both just to be sure, > as I have outlined above. Temporarily and for purposes of discussion, assume I agree with the above as far as it goes (see below). Given that, what do you, and the systems you have tested, propose to do about addresses that contain non-ASCII characters in the local-part (explicitly allowed by the present spec)? Note that lowercasing [1] and case folding are different and produce different results and that both are language-sensitive in a number of cases, what specifically do you think the spec should recommend? Also, do you think it is acceptable to publish this document with _any_ suggestions about lower-casing or "try this, then try something else" search without at least an "Internationalization Considerations" section that would discuss the issues [1] and/or some more specific recommendation than "try lowercase" (more on that, with a different problem case, below). Dropping that assumption of agreement for discussion, I personally believe that this document could be acceptable _as an Experimental spec_ with any of the following three models, but not without any of them: (i) The present "MUST not try to guess" text. (ii) A recommendation about lowercasing along the lines you have outlined but with a clear discussion of i18n issues and how to handle them [2]. (iii) A clear statement that the experiment is just an experiment and that, for the purposes of the experiment, addresses that contain non-ASCII characters in the local part are not acceptable (note that would also require pulling the UTF-8 discussion out of Section 3 and dropping the references to RFC 6530 and friends). To be sure I understand what you are suggesting and save a separate note, neither the EAI specs (RFC 6530 et al.) nor the text in section 3 of the current document specify that local-part strings are required to be normalized. For such strings, even when they are entirely in lower case when presented by the user, there may be multiple different forms, e.g., U+0066 U+006F U+0308 U+006F and U+0066 U+00F6 U+006F are perfectly good (and SMTPUTF8-valid) representations of the string "föo" Using the same theory as your lower case approach, would you recommend trying first one of those and then the other [3]? The more I think about it, the more I'm convinced that the specification and allowance for UTF-8 [4] in the first bullet of Section 3 is unacceptable without either text there that much more carefully describes (and specifies what to do about) these cases or an "Internationalization Considerations" section that provides the same information. I suggest that anyone contemplating writing such text carefully study (not just reference) Section 10.1 of RFC 6530. Of course, simply excluding non-ASCII local-parts from the experiment, as suggested in (iii) above, would be an alternative. I have mixed feelings about whether it would be an acceptable one for an experiment. I am quite sure it would not be acceptable for a standards-track document when the EAI work and/or the IETF commitment to diversity are considered. john [1] For the benefit of those who are blissfully unaware of these problems and sticking to Latin script as an example, consider the the lower case form of "A". Define "lower case form" as characters that can produce "A" under at least some circumstances. The examples toward the right are less likely than those to the left, but all are lower case forms for "A". a à á â ã ä If that example doesn't cause either an insight or an adequately bad headache, consider the lower case possibilities for "I", including whether they are dotted or dotless and that either the dotted or dotless forms can appear in combination with some of the diacritical markings above. [2] I note that, historically, the DNS community has been very reluctant to accept techniques that depend on or imply multiple lookups for a single perceived object and, separately, for "guess at this, try it, and, if that does not work, guess at something else" approaches. Unless those concerns have disappeared, the potential for combinatorial explosion when lower-casing characters that may lie outside the ASCII repertoire is truly impressive. [3] In case it isn't clear, while "föoēy" has only one NFC and one NFD form as a string, if unnormalized forms are allowed, it would imply the potential for up to four lookups, not two. [4] Nit: "encoded in UTF-8" is not actually a sufficient statement. The correct statement would be similar to "Unicode encoded in UTF-8".