On 02/15/2016 05:46 PM, John C Klensin wrote: > > --On Monday, February 15, 2016 4:33 PM +0100 Harald Alvestrand > <harald@xxxxxxxxxxxxx> wrote: > >> Note that the user understandability of "only lowercase if >> it's all ASCII" is zero. >> >> If ARNE matches arne, but BLÅBÆR doesn't match blåbær, any >> user from an extended-ASCII country (which is *all* Latin >> script using countries, even though the non-ASCII variants in >> English are rarely used) will be mighty confused. > Indeed. > > However, that is exactly the decision we made with IDNA (both > the "2003" and "2008" versions and, as there, may be > justification for really strong advice for treating email > addresses (both local and domain parts) as lower-case only. > > Harald, I am confident you know all of this, but others may > not... The idea of requiring that mailbox names be treated as > all lower case was discussed during the work leading up to RFC > 1123 and again in DRUMS (pre-2821). The community reached what > appeared to me as fairly strong consensus that we just couldn't > do it. Part of the problem was that, at the time 821 was > written (and maybe as late as the time of DRUMS) there were > still systems around that operated upper-case-only and had only > the vaguest idea what lower case was. Another part was that > Unix (and Multics) and some of their successors were very > case-sensitive in general: "foo" and "Foo" and "foO" were > unambiguously three different names. > > Because of that history and consensus, the strong suggestions in > 5321 are about as far as one is going to get as far as > restrictions/ recommendations on the mailbox names themselves > and the "don't try to guess" rule probably isn't going anywhere. > > In retrospect, we dodged a bullet because, for mailbox local > parts, ARNE does not, in terms of anything a sender is allowed > to predict, match arne. That BLÅBÆR doesn't match blåbær > may still be a surprise to some, but it is not more or a > surprise. > > >From that perspective, the problem facing DANE is that either > the basic "if they are not identical, they don't match" rules is > applied or there is a need to invent rules different from the > email rules and that de facto modify the email rules by > restricting the syntax of a mailbox if there is any possibility > a DANE DNS record will be used with it. Nothing I'm aware of > (other than probably the WG Charter) prohibits DANE from > proposing an update to 5321 and 6530ff, but the history (and > probable side-effects that no one has tried to analyze) predicts > that the idea won't easily get community consensus. Yep. I'm sympathetic to the quandary of DANE. Our strong advice was basically "if you (the recipient's mailbox manager) depend on case differences to tell mailboxes apart, you are a fool; if you (the sender) depend on case not mattering, you are a bigger fool." DANE is an algorithm for the *sender* to look up information about the *recipient*'s mailbox in the DNS, which means that the whole experiment depends on the sender (who has no idea of what or where the recipient is) being able to construct exactly the same hash that is generated by the recipient - incompatible with the two pieces of advice I have abstracted out above. A possible way out (strawman!!!!) would be to say: - All recipient participants in the experiment MUST agree to ignore case differences in mailbox names. This has no effect on non-participants, so we can possibly get consensus for that. - All code in the experiment MUST use a particular algorithm to generate the LHS lookup key (I would suggest toLowerCase(NFC(string) in the C locale) off the top of my head - but one could also argue for caseFold(NFC(string)) or NFC(caseFold(string)) - and the people choosing had better know the difference) The case operations referenced are in Unicode 8.0.0 section 5.18 - I *strongly* recommend actually reading that chapter, and not making the (invalid) assumption that calling toLower() in some random library will actually do something compatible with this. I don't think anything less precise has a chance of being interoperable. BTW, this text from the draft is obviously not saying what it intended to say: o The user name (the "left-hand side" of the email address, called the "local-part" in the mail message format definition [RFC5322] and the local-part in the specification for internationalized email [RFC6530]) should already be encoded in UTF-8 (or its subset ASCII). If it is written in another encoding it should be converted to UTF-8 and then hashed using the SHA2-256 [RFC5754] algorithm, with the hash truncated to 28 octets and represented in its hexadecimal representation, to become the left-most label in the prepared domain name. Truncation comes from the right-most octets. This does not include the at symbol ("@") that separates the left and right sides of the email address. As written, it states that hashing is only applied to strings that are not originally in UTF-8 - but the "for example" text below makes it clear that this is not intended. Replacing "and then" with ". The string is then" would fix the problem.