Re: I-D file formats and internationalization

Paul Hoffman <paul.hoffman@xxxxxxxx> · Wed, 30 Nov 2005 18:29:46 -0800

At 5:59 PM -0800 11/30/05, Douglas Otis wrote:
On Nov 30, 2005, at 2:23 PM, Paul Hoffman wrote:
At 1:54 PM -0800 11/30/05, Douglas Otis wrote:

Rather than opening RFCs to text utilizing any character-set 
anywhere, as this draft suggests,

That is not what the RFC suggests at all. The character set is 
Unicode. The encoding is UTF-8. That's it.

Unicode provides a unique number for every possible character within 
a current range of about 97,000 characters.  These characters 
include punctuation marks, diacritics, mathematical and technical 
symbols, arrows, dingbats, etc.  Displaying one of these characters 
requires a character-set (synonymous with a display system's 
font-set or character-repertoire), or using the unicode vernacular, 
a script.  It is not just a matter of which character is displayed, 
which character-repertoire is used, but there are also Middle 
Eastern right-to-left issues as well.

It may be better to use a single vocabulary for discussing things 
such as internationalization and character sets. That's the purpose 
of RFC 3536.

Being able to review the ID as it would appear as an RFC would 
also seem to be a requirement.

That means changing the Internet Drafts process as well. Certainly 
possible, but more daunting that changing one process at a time.

As an ID becomes an RFC, it seems expecting last minute changes to 
the document would be even more daunting.

Yep, that's the tradeoff. We already make some automatic changes 
after in Internet Draft is approved by the IESG, and we allow others 
without IESG oversight. This would be another class. That scares some 
people, and not others. Having Internet Drafts use Unicode in UTF-8 
instead of ASCII scares some people, and not others.

  It seems problematic for protocol examples to use non-ASCII 
characters owing to there not being ubiquitously displayable 
character-sets.

Unicode is universally displayable if you have the right font(s). 
Regardless of that, however, any sane document author would not 
assume that every person reading the document could display it. 
They would put a legend or explanation near the example.

Assume such characters can not be displayed, at least not with the 
ASCII version that excludes the extended character-set allowed by 
unicode.  An escape mechanism would be needed to accommodate 
alternative text, where displaying '?' for the unicode characters 
that extends beyond ASCII would not be a very satisfactory solution, 
as this would make the ASCII version less authoritative, to say the 
least, and break the way many use the RFC text files.

No escape mechanism is needed. Non-displayable characters are still 
in the RFC, they simply can't be displayed by everyone (but they can 
be displayed by many). This is infinitely simpler, and a much better 
long-term solution, than "an escape mechanism". Further, there would 
be no more "ASCII version" to be authoritative. The Internet Draft 
clearly says that there is a single RFC, and it has a single encoding.

  I liked the idea that Frank suggested, use the HTML escape 
sequence to declare the unicode character.  This allows the ASCII 
version to remain authoritative.

... as well as unreadable and unsearchable using normal search 
mechanisms. The purpose of the proposal is to allow RFCs to be 
readable and searchable using the encoding that is common on the 
Internet, without resorting to sorta-kinda-HTML or an "escape 
mechanism". Remaining with plain ASCII would be better than either of 
the latter.

--Paul Hoffman, Director
--VPN Consortium

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf