RE: RFC-1345 "mnemonics"

<michael.dillon@xxxxxx> · Mon, 17 Sep 2007 14:34:34 +0100

> 	Folks who haven't been involved in multi-lingual 
> computing might not realise quite what a minefield this whole 
> area is.  Approximately speaking, choices about how to 
> represent glyphs/letters and how to encode them usually have 
> some inherent linguistic bias. 

In addition, many applications of encoding glyphs need to consider
direction. It is one thing to transliterate another writing system into
ASCII for the purposes of providing a readable format. It is another
thing to have an encoding that works in the other direction, so that
some form of ASCII input can be accurately translated into a set of
UNICODE glyphs. The former (output tranliteration) is not so hard to
achieve, eg. Any russian speaker will understand "Ya ne znayu chto
skazat'" and any Chinese who has learned a roman alphabet language
should be able to understand "Wo mi-lu le". Of course "Ja nye znaju shto
skazat" would be just as readable to most Russians, and most Chinese
would struggle with long texts that lack the accents that official Hanyu
Pinyin uses to mark the four tones. I think it is conceivable to supply
an official output transliteration to ASCII for all Unicode glyphs and
much of this work has already been done by language groups, for instance
TISCII transliterates Tamil letters into ASCII.

Translating from ASCII into Unicode is far more complex and probably
impossible without introducing inter-glyph gaps and weird accent codes
like RFC 1345 did. Many of the input methods that are used for typing
foreign language glyphs into a computer are actually "Input Method
Editors" that have inbuilt dictionaries and help users choose one of a
series of possible glyphs. For instance, Japanese can represent a single
syllable fu with 3 or more glyphs. To choose the right one you need to
know if the entire word is borrowed from a foreign language, and if not,
then there is still the choice whether or not to use a Hiragana glyph,
or choose one of the Kanji glyphs borrowed from Chinese. There are at
least two Kanji glyphs with an ON-reading of "fu". In spite of all this
complexity on input, there is a standard transliteration for output. 

> People who use languages that 
> aren't optimally encoded in some representation tend not to 
> be very happy with the non-optimal encodings their language 
> might have been given.

That is the key to this whole thing. What is the use-case for these
ASCII encodings? If an encoding is not usable by native speakers of the
language using the original glyphs, then I can't see any worthwhile use
case at all. UTF-8 works for any machine-to-machine communication.
Existing input methods work for converting from ASCII to glyphs but
these are not simple mapping tables. And existing transliteration
systems work for generating readable ASCII text from glyphs, although
they may not be fully standardised. For instance a Russian named "Yuri"
(transliterated according to English pronunciation rules) will have his
name written as "Iouri" on his passport because that is transliterated
according to French pronunciation rules since French was the former
lingua franca of international diplomacy.

RFC 1345 should be deprecated because it misleads application developers
e.g. the Lynx case, and the work on transliteration and input methods is
being done more effectively outside the IETF.

--Michael Dillon

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf