One addition to Harald's comments... --On Sunday, 23 March, 2008 20:43 +0100 Harald Tveit Alvestrand <harald@xxxxxxxxxxxxx> wrote: >> Because internationalized local parts may cause email >> addresses to be longer, processes which parse, store, or >> handle email addresses or local parts must take extra care >> not to overflow buffers, truncate addresses, exceed storage >> allotments, or, when comparing, fail to use the entire >> length. >> >> technical: this is great advice, but I don't understand how >> UTF-8 changes the situation. If you aren't changing the >> 998-octet requirement, software that breaks for UTF-8 would >> also break for ASCII headers with the same octet >> length. > If someone uses another representation internally (for > instance UTF-16), and has a 998-character buffer, that will > sometimes fit into 998 octets of UTF-8, and sometimes not. > The same goes in the other direction.... I'm sure others will > think of other cases. Spencer, I'm a little confused by your even asking the question, so let me try for a slightly different answer in case you were asking a different question. Two of the advantages we have with ASCII (and the closely-related ISO 8859 code character sets) are that every character is the same length as every other character and that every character is exactly one octet. As a consequence of that relationship, we have clutter in many places in the RFC space, and probably in implementations, in which "character" and "octet" are used interchangeably when referring to lengths. I note that you carefully, and correctly, said "same octet length" above and not the "same length in characters". But RFC 821 talks about lengths in characters and, to my astonishment and shame, so does section 4.5.3.1 of rfc2821bis (I've just flagged that to the relevant ADs and will try to get it fixed before the thing is published). But that is the definitional problem, and perhaps the new risk, in a nutshell. Now, if one goes to UTF-32, the characters are all the same length, but four octets instead of one. An implementation that counts characters, but allocates buffers in octets (assuming that they are the same thing) is obviously headed for trouble, but computing the length from the character count or vice versa is pretty straightforward. UTF-8 (and technically UTF-16) break both of those original assumptions. The characters may be more than one octet long and one cannot compute the number of octets from the number of characters (UTF-8 is aggressively variable-length; UTF-16 occupies either two or four octets per character depending on whether the character has a high enough code point that surrogate pairs are needed). >... >> 9.2. Informative References >> >> >> [Hoffman-utf8-headers] >> Hoffman, P., "SMTP Service Extensions or >> Transmission of Headers in UTF-8 Encoding", >> draft-hoffman-utf8headers-00.txt (work in >> progress), December 2003. >> >> Technical: I know this is how we refer to Internet Drafts, >> but "2003" isn't >> "work in progress". You might s/work in progress/expired >> Internet Draft/, or >> (probably better) simply move the rest of the full citation >> to the Acknowledgements section - it didn't seem like you >> really expected anyone to >> actually refer to this reference, anyway :-) > It's a part of the history, and we can probably safely lose it. It is referenced, and its historical role mentioned, in RFC 4952, so can almost certainly be dropped utf8headers. On the more general subject, I've tried raising the issue of these documents that are referenced for historical reasons and hence, IMO, should not say "work in progress" and should include the exact file name so that people can find them if interested. I've gotten nowhere, so it is someone else's turn. What is really needed, I think, is a policy on these sorts of things, corresponding modifications to tools like xml2rfc, etc. I don't think hiding the references in inline text is that right answer, but that is just my opinion. best, john _______________________________________________ IETF mailing list IETF@xxxxxxxx https://www.ietf.org/mailman/listinfo/ietf