May be some analysis to structure the debate. Lingual digital relations are supported through three layers: (1) computer interoperability, (2) human interintelligibility,(3) human interface.
- at layer 2 relations are brain to brain and support interintelligibility in using written languages. The scripts of these languages are supported through the Unicode system and are to be tagged for computer recognition.
- at layer 1 relations are end to end and support interoperability in using protocols with various digital, hexa, 7 or 8 bits coding and parameter systems registered with the IANA.
One of these protocol is the DNS which uses a "-.0Z" numbering plan within the 7 bits area, simplifying its human utilization by reference to Arab 0-9 universally used characters and internationally used Roman A-Z characters. This also permits an easy bridging with other plans restricted to 0-9, O-B, or 0-F and the direct support of telephone numeric names. It has a direct total or partial mnemonic capacity for persons having English, Latin or Latin scripted languages.
Internationalization, at end to end layer, permits (punycode in the DNS case, not defined in the email LHS) to support a multilingualization at brain to brain layer and to provide the same mnemonic capacity to people having other languages. Vernacularization is the process which permits human interfaces and applications processes to fully take advantage of multilingualization, in usage cases ranging from language menus or combos to full IRI support.
A common problem is to overlook the multilingualization layer because it is transparent in English (an ASCII string is not affected by punycode). This layer violation creates the discussed security violation. This layer violation is the Verisign's disrespect of the ICANN requirements (at multilingualization layer) requiring the registration of IDNs using codes from a single language Table.
This common overlook of the multilingualization layer is aggravated by the proposition of a unique internationalization layer langtag (independent from IDN language Tables) where it does not belong: to describe all the vernacular views of a language.
IMHO, a correct generalized approach of multilingualism in the Internet consists in structurally acknowledging the three layers permitting to clearly tell the users in which exact context they are. This should be based upon a five constructors language tag (lang5tag):
- three internationalization layer descriptors. They are used to register the IDN Tables: the language, the script and the domain of use. The RFC 3066 define the use of ISO 639 codes for the language. RFC 3066bis proposes to use the codes of ISO 3166 for national domains and ISO 15924 for the scripts. This is a basic correct proposition, there are more general and more precise sources if needed.
- a multilingualization layer descriptor: the authoritative reference for the considered view of the language.
- a vernacularization layer descriptor: the style, that is the environment of the considered application (protocol, administrative, familial, formal, commercial, SMS, adult, etc.)
This lang5tag should be part of the IRI description, and supported by an icon to be shown in the browser bar. An example: if you send a mail your boss secretary will print and present in his daily folder, you may want him to know you sent it from a Chinese mobile instead of from your English text processor. An ISO 7000 conformant glyph system can probably be designed.
jfc
On 15:00 12/02/2005, John C Klensin said:
--On Friday, 11 February, 2005 21:02 -0500 Bruce Lilly <blilly@xxxxxxxxx> wrote:
> While I do not dispute that some mobile devices might use some > subset of some version of Unicode for text in some languages, > my point was, in response to John Klensin's "Until and unless > every one of us has a keyboard that permits easy input of > every Unicode character", that not only do I not expect to > have a keyboard permitting *easy* entry (no, that doesn't mean > "Grafiti" or "Decuma") of *every* Unicode character any time > soon, I don't expect it *ever*, because the Unicode code space > is expanding (in contradiction to the original Unicode Design > Principles) faster than the available memory space on > low-power, compact, mobile devices.
Bruce (and others),
You can argue and pick at this interminably, but I think you are missing the key point.
There is, IMO, an extremely strong argument for saying
"Look, DNS names, and DNs as used in X.509 certs, are ultimately protocol identifiers. Safe and stable operation of the Internet requires that protocol identifiers be written in a small, restricted, generally recognized, and easily distinguishable, set of characters. And everyone who has studied which characters to use when the principles of "protocol identifiers" and " statements are applied, including our very internationalization-conscious friends at the ITU, have concluded that the right characters are a subset of those in the Roman-based script family. The subset seem to always be "without the embellishments of diacritical marks or other embellishments". It is almost always defined in terms of case-independent matching rules or in terms of only a single case being permitted -- more often upper historically, although there are some substantive arguments for lower. "
The choice of Roman characters is ultimately based on the observation that, while there are several _languages_ that are more widespread than English, nothing in the above says anything about English. Those Roman-based characters are, for one reason or another, used, either as a primary or a secondary script, by more languages and people than everything else in the world put together. That contributes significantly to "recognizable", which is an important criterion.
And neither the "protocol parameter" argument, nor the argument that more characters would lead to more opportunities for confusion, did not come as a surprise to the IETF community within the last week or two. Both arguments were raised, passionately and at great length, when the IDN effort was first coming together. They were raised on the IETF list, on more than one WG list, in BOFs, etc.
There is a second argument that can be made with equal strength. People like to write their names correctly. Inability to do that is a profound source of irritation (at least) and was important enough, even in the 60s, to influence the way characters are handled in important operating systems to this day. More generally, people prefer that the identifiers they pick have mnemonic value to them, and that means the ability to pick those identifiers based on their languages and scripts. Please note that argument applies at the geek interface level; we don't need to get up to the user interface one to make it. When we do get to the user interface and start worrying about non-expert would-be users of the Internet, we immediately encounter some very passionate, and almost certainly correct, arguments that users should be able to deal with, and navigate, the Internet and do so completely in their own languages and scripts.
The problems with that argument, including opportunities for deliberate or accidental confusion among similar-looking characters, also come as no surprise to the IETF. Like the "protocol parameter" position, they were discussed openly and at great length, with examples, many years ago.
With both of those arguments in hand, and with the problems with each at least moderately well understood, the IETF (or at least everyone who could be persuaded to pay attention) made a decision. That decision was made years ago and under considerable marketplace pressure, that, for the particular set of issue areas that included DNS names, the second set of arguments -- that accessibility in "native scripts" (and Unicode in particular) were more important than the "protocol identifier" argument-- were the dominant ones and that we needed to do this. By implication at least, we decided that we would need to accept and understand the problems that decision caused and deal with them.
There were another group of questions, which are the more complicated piece of the issue. The obvious way to get the right functionality is not necessarily the best one. There is a nasty tradeoff between techniques that can, at least in theory, be deployed quickly and ones that are likely to take longer but might be more satisfactory in the long term. There is another nasty tradeoff between making something work well for the people who know that they need it and are willing to make an investment in conversion and upgrading of systems to get it versus making it work reasonably well (and perhaps more quickly) for everyone.
Again, the IETF made decisions on those points. My personal view is that some of those decisions were not especially well-informed and may even have been wrong, but they were decisions made in the community and made after the dissenting views were strongly expressed.
So, today, we've got IDNs and IDNA. Even if one believes that the _only_ reason for standardizing them is to provide a common, interoperable, way of doing something that people will clearly do somehow, the standards seem justified. (For the record, I do not subscribe to the "that is the only reason for a standard" position in this case.) I see no way to go back, even if we wanted to, and reestablish the "protocol parameter" argument for the DNS.
So we are down to some serious and important questions -- but, again, ones that are neither new nor surprising. In particular, since you and others have picked up bits from my earlier notes and interpreted them (I'm sure unintentionally) differently from what I intended:
(i) The observation about YAH00 versus yah00 wasn't intended to say that a lower case test would solve very many problems. It was only to point out that the particular YAH00 example wasn't a particularly good one, since it could be detected by the most trivial of tests. I agree that test is not likely to be effective against a determined attacker or more clever examples.
(ii) I have never argued that the "one label one script" requirement that Mark Davis and others have suggested is without value. My comment was only that a requirement of that type was going to be a little harder to apply --in many cases and consistently-- than a casual reader might assume. None of this is easy. Life is hard.
(iii) The observation about "...easy input of every Unicode character" is not, in any respect, an attempt to get us back to protocol identifiers. It was, instead, about one of the more subtle questions associated with the IDNA story. IDNA's most passionate advocates are convinced that, once a sufficient deployment level is achieved, no one will need to look at the internal, "punycode" form of IDNs, but will see only the "native character" form. Others of us are convinced that user-visible punycode will be around forever, just as user-visible URLs will be. We believe that will be driven partially by security concerns (I can more accurately compare two punycode strings by eyeball than I can a pair of arbitrary "native character" strings). We believe that the difficulties you might have reading an IRI that contains an unfamiliar script out of a printed article or sign and typing it into a computer will cause you to wish that the punycode representation were readily available, because "recognize the character and then figure out how to key it in" is likely to be an insurmountable pair of problems. The issue isn't one of the expansion of Unicode or how many keystrokes are needed: if you can identify the character, any BMP Unicode character can be keyed in a little over four keystrokes, and non-BMP characters don't take many more (the "little" is determined by whatever you need to do to indicate that characters are being specified by offset. The issue is recognizing the character accurately in the first place. The cell phone story is equally unimportant because the first step in that story is identifying the right language so as to permit you to pick up the right phone (or switch it into the right state). Language identification may or may not be harder than character identification, but it isn't likely to be easy in the general case. Without language identification, you are back to character identification and four (or five or six) digit offsets.
(iv) The TLD managers worldwide are not crying "please protect us from IDNs" and this latest "discovery" is unlikely to change that. What they are saying is "we want and need to implement IDNs, please help us understand how to do that safely". The answer to that question doesn't require "regulation" from on high. It does require getting and sharing a much more subtle understanding of the issues, options, and tools than we have so far been successful in communicating. IMO, the IETF should be putting energy into those issues and tools --and to alternatives to the use of DNS names (with IDNA) when that is appropriate. But efforts to move in those directions have gotten zero traction. _That_ is, IMO, our problem, not whether we can turn back the clock and make a "protocol parameter" decision (or turn it back even further and reduce the number of scripts and characters in the world be several orders of magnitude).
This isn't easy. It is never going to be easy. It poses opportunities for various nasty behavior that are harder to detect and defeat in a hostname/LDH-only world. The easiest way to get ourselves into trouble is probably to pretend it is easy and ignore the hard, risky, or edge cases. We need to learn to cope: wishing for an easier and more homogeneous world or easier times generally, or wishing that an irreversible decision be reversed, won't get us much of anywhere, no matter how passionately those wishes are made. And, like it or not, we are at as least as much risk of fragmenting the Internet by appearing to say "no" to some languages or scripts as we are from confusion among characters in well-thought-out internationalization efforts.
john
_______________________________________________ Ietf@xxxxxxxx https://www1.ietf.org/mailman/listinfo/ietf
_______________________________________________ Ietf@xxxxxxxx https://www1.ietf.org/mailman/listinfo/ietf