Re: [Last-Call] [I18ndir] last call reviews of draft-ietf-regext-epp-eai-12 (and -15)

Asmus Freytag <asmusf@xxxxxxxxxxxxx> · Mon, 26 Sep 2022 20:00:36 -0700



    On 9/26/2022 12:31 AM, Martin J. Dürst
      wrote:

    
    Very
      sorry to be late with my reply, and for not replying to the latest
      posting from John Klensin in this thread.
      

      On 2022-09-14 04:03, John C Klensin wrote:
      

      James,
        

        My apologies for not having responded to your note sooner.
        

        I've been preoccupied with several unrelated things.
        

        I greatly appreciate the changes to use an existing EPP
        

        extension framework and to correct the terminology error of EAI
        

        -> SMTPUTF8.   I agree that the more substantive SMTPUTF8
        

        technical issues should go back to the WG.
        

        However, in order that the discussion you suggest for IETF 115
        

        be useful and not just lead to another round of heated Last Call
        

        discussions, I think that, for the benefit of those who have
        

        been following the discussion closely and those who should have
        

        been, it is important to be clear about what the disagreement is
        

        about.  When you characterize the issue as "e-mail cardinality",
        

        it makes it sound, at least to me (maybe everyone in the WG has
        

        a better understanding) like this is some subtle technical
        

        matter.
        

        It really isn't.  The EAI WG was very clear during the
        

        development of the SMTPUTF8 standards that the biggest problems
        

        with non-ASCII email addresses were going to be with user agents
        

        (MUAs) (and, to some degree, with IMAP and POP servers that are
        

        often modeled as part of MUAs) and not with SMTP transport over
        

        the Internet.  Making an MUA tailored to one particular language
        

        and script (in addition to ASCII), or even a handful of them, is
        

        fairly easy.  Making one that can deal well with all possible
        

        SMTPUTF8 addresses is very difficult (some would claim
        

        impossible, at least without per-language, or
        

        per-language-group, plugins or equivalent).
        

      I very strongly think that "an MUA that can deal well with all
      possible SMTPUTF8 addresses" is a red herring.
      

      First, as far as backing store (in-memory representation) is
      concerned, any implementation that is able to handle full Unicode
      and SMTPUTF8 will be fine; there's no dependency there on natural
      languages or scripts. And because there days, most MUAs will use
      user-interface tool kits or OS components that support Unicode,
      for most MUAs, that part may be essentially for free. This leaves
      the logic of "if non-ASCII in LHS of email address, then use
      SMTPUTF8, otherwise not" and the transcoding from the internal
      Unicode representation (possibly UTF-16) to and from UTF-8
      (available as a library function). So on this level, an MUA that
      is able to deal with SMTPUTF8 is able to deal with all possible
      SMTPUTF8 addresses, or otherwise it's very badly written.
      

    Thank you for putting this so clearly. I had assumed that to be
    true, but didn't want to say anything because I'm not specifically
    conversant with just the e-mail protocols. The situation you
    describe is now pretty much the standard for any type of application
    that "supports Unicode", for whatever purpose. Which makes making
    exceptions for some type of Unicode strings rather less well
    motivated.

    
      Second is the level of display. Here again, it's important to
      understand that MUA implementers will just use a tool kit, which
      includes a rendering library (such as harfbuzz) that takes care of
      all the glyph selection and shaping details. And it will use (via
      that library) the fonts available on the OS. If the necessary font
      is not available (e.g. for scripts just recently added to
      Unicode), then square boxes or question marks or something similar
      will be displayed, but it should still be possible to copy an
      address from a browser to an (SMTPUTF8-capable) MUA and send the
      mail. Similar for rendering variations; the browser may show a
      frog with a tongue, but the MUA may show a frog followed by a
      tongue. If that's the result of copy-paste, the mail should still
      be delivered correctly.
    This is the crux. These kinds of toolkits and platform support is
    widely available (except for scripts that Unicode explicitly
    recommends to exclude from IDNs. (But I see that you are getting to
    hat part of the argument below).

    
      [It is important to note here that these days, the numbers of
      email addresses that get copied by hand from a napkin or business
      card to an MUA is way down, and copying from one application (e.g.
      a browser) to another is the main stream.]
      

    Transcoding to ASCII (or alternate address) solves only the issue of
    guranteeing that an operator can distinguish two strings from each
    other (having had to learn only one small set of symbols, in case
    ASCII isn't part of their native writing system) and it's a nice
    fallback way of keying in data - again, for trained operators. (We
    are informed that there are scripts where native users see ASCII as
    a barrier).

    
      Third, there's a saying "the better is the enemy of the good". It
      can be abused to justify sloppiness, but in the area of
      internationalization, it's very important. If somebody wants to
      use a Cyrillic or Devanagari or Han (Chinese/Japanese) or
      Greek,... email address, they don't care whether a script such as
      Nag Mundari (new in Unicode 15.0.0, out on September 13) or some
      Egyptian hieroglyph format controls (also new in Unicode 15.0.0)
      or even some Devanagari characters used to represent auspicious
      signs found in inscriptions and manuscripts (dito) are available.
      Because of the very long tail of languages, scripts, and
      characters, a requirement that "all possible SMTPUTF8 addresses"
      are covered is very counterproductive. It denies the huge majority
      of people interested in such addresses something because there may
      be other who aren't yet able to get it, and in turn will only
      cause additional delay for everybody.
      

    Realistically, there's limited use case for anything not in the ~30
    or so recommended scripts (for identifiers). There may be less than
    a dozen of the "limited use" scripts for which there is detectable
    online use of the kind that would correlate to those scripts being
    used for any type of identifier (IDN, email names, user handles in
    social media).

    
    I did an informal study a while back on that. So that leaves
      between half and two thirds of all scripts that are used in very
      constrained settings (digitally archiving ancient text, text
      examples in scholarly discourse and what have you, including
      scripts for moribund languages or obsolescent writing systems).
    Achieving solutions that "perfectly" cover these cases, and
      holding up specifications on their account, makes them the "enemy
      of the good".
    This is a bit of a change for Unicode. Up until about 10 versions
      ago, there were still significant additions or improvements for
      the modern repertoires of modern scripts. That has basically
      stopped. The only modern writing system with modern repertoire
      that is being actively added to are emoji.
    All other additions to "modern" scripts are those that are used
      for historic documents, scholarly purposes or to capture some
      smaller languages, dialects or whatever, many of them falling out
      of use, used only orally, except when documented, etc.
    Those cases really aren't realistically required to"work",
      because if you insist on using them, you might as well have put a
      box into your alias, because nobody other than you is likely to
      understand what you are trying to do.
    On the other hand, Hindi, Ukrainian, Greek, Farsi and Korean
      should be easy to support as is and thus present no great
      impediments to use.

    
      So my conclusion for the draft in question is that allowing more
      than one email address won't hurt, saying that one of them can be
      used for an all-ASCII fallback won't hurt, but not moving the
      draft forward if these changes are not made isn't really
      justified.
      

    I wholeheartedly concur!
    A./

    
      Regards,   Martin.
      

-- 
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call