On Mon, Jan 12, 2015 at 10:41:01PM -0500, John C Klensin wrote: > --On Monday, January 12, 2015 18:08 -0600 Nico Williams > <nico@xxxxxxxxxxxxxxxx> wrote: > > Well alright. I'd love to see a set of guidelines for I18N > > activities. > > So would we all. RFC 2277 was supposed to provide some guidance > but is now badly obsolete in many different ways, including > exhibiting how little we knew about some things at the time. We > have, I hope, learned a lot, but see below. > > > When should we try to support Unicode, and when should we not? > > Is it one of those "I know it when I see it" kinds of > > guidelines? That wouldn't be useful enough :( > > Let me suggest a general way of thinking about things -- maybe > not quite a "guideline". Especially for security-type > protocols, make sure there is a substantive reason, presumably > connected to users and user experience, for it to be necessary > to go beyond ASCII. I really do mean "necessary": if it is just > a good idea in principle or a maybe-nice-to-have or "maybe > someone will want this some day", skip it because adding i18n > capabilities _will_ make correct and predictable implementations > more difficult and _will_ increase the number and range of > attack opportunities. Yes, I18N is all about UIs and the UX. Clearly, if a character string isn't a UI element, and is never a visible aspect of the UX, then it is a great candidate for being made US-ASCII only. Indeed, we *should* make all such strings US-ASCII only. That much is obvious, and whether or not something is part of the UI is an objective measure with relatively little room for doubt. But there are UI elements that could reasonably be constrained to US-ASCII (because the world over, people manage to deal with US-ASCII character strings in various parts of their UIs). The tricky part is deciding what UI elements (or things leaking into them) qualify. For example, a "manufacturer" name in PKCS#11 could reasonably be constrained to US-ASCII only. Right? Well, maybe a French -say- manufacturer might object. An interesting distinction here might be: name or identifier? Identifiers (appearing in UIs) -> US-ASCII. Names -> Unicode. Token and object labels seem a lot like identifiers in the use cases I expect. But I can't be certain that they would never be expected to contain names. Manufacturer names really are names, no? These are decisions that we can make that can anger people who are not participating here today. > > Mind you, IIRC PKCS#11 didn't even say anything about ASCII > > before. Token labels and such used to be fixed-sized octet > > strings containing character data. Jan can correct me if I'm > > wrong. I'm not sure even saying "ASCII-only" would > > necessarily be safe in that case... > > And that reinforces my view that the real, underlying, problem > here has to be fixed in PKCS#11, not in anything the IETF puts > on top of it. Only they can fix the problems; we can, at best, > mitigate the damage. Yes. But look, PKCS#11 is a thing with a low count of character strings. Mostly things will be looked for with equivalence semantics, and form-insensitive Unicode string comparison will do for that (at the expense of having the code for it), as will plain old octet string comparison (because we can expect happy input method output form agreement accidents). I think Jan's text is fine. I don't mean to belabor this thread. I'm now only commenting on the more general matter of when we should be happy to settle for less than the full I18N treatment. > > Fortunately the OASIS PKCS11 TC has clarified that these are > > UTF-8; unfortunately they left other I18N details out. > > It appears to me that what they have said puts their level of > understanding of the various issues somewhat behind where we > were when RFC 2277 was written in 1997. Yes, but it's also fair to note the above, that this is the sort of case where a low-effort I18N ("say UTF-8; say nothing about anything else") seems likely to be good enough for most implementors and users. Nico --