Re: [Last-Call] New Version Notification for draft-crocker-inreply-react-07.txt

John C Klensin <john-ietf@xxxxxxx> · Thu, 25 Feb 2021 22:09:25 -0500

Kjetil,

Disclaimer: I read your note and started this response just
after sending my long one but was interrupted by a meeting/ call
I had to take.  So, if there are messages that have arrived
since mine was sent, I have not read them yet.  If they bear on
this part of the conversation, my apologies.

The much longer note I just posted covers some of this and I
will try to not repeat myself, but a few comments: 

First and most important, thanks for the reminder that issues of
this general family were raised much earlier and not really
addressed and this is not just Patrik and myself having late
attacks of Unicode- or 118n-anxiety.

More specific comments in line below (with some trimming)...

--On Thursday, February 25, 2021 22:05 +0100 Kjetil Torgrim
Homme <kjetilho@xxxxxxxxxx> wrote:

>...
> This was partially one of the points I made earlier.  The
> draft is eerily silent on what to do with a reaction like "J".
> It is not an unallocated code point, but it is not a valid
> emoji either.  When I brought it up, Dave seemed to expect it
> to be presented as the plain "J" it is.  (Also consider the
> draft explicitly accepts single-byte emojis, even though this
> is at odds with Emoji-Seq.) 

Actually, that is an odd near-ambiguity in the document that I
think should be corrected as other things are sorted out (and,
if needed, flagged to the RFC Editor before publication).  If
one puts the URL aside as a convenience, the reference for
[Emoji-Seq] points to UTS#51, which clearly allows single-code
point [1] emoji and even some traditional symbols.  On the other
hand, the link is to
http://www.unicode.org/reports/tr51/#def_emoji_sequence, which,
at least today, is rule ED-17 of the Version 13.1, 2020-09-18
version of UTS#51.  It points back to ED-15 for
<emoji_core_sequence> which rather clearly (at least IMO) allows
a single emoji character, which leads to ED-3 and single code
point emoji.    So, probably that is consistent in the document.
Whether it is reasonable to expect anyone implementing the I-D
to search through that is another question but is part of my
concern about incorporating UTS#51 by reference and moving on.

On the other hand, because "J" (I'm getting U+0022 from your
message -- see below-- and not a special symbol) does not appear
to have the emoji property, I believe the I-D forbids it
entirely and having it appear as "J" would violate the spec.  If
the intention is that non-emoji appear as themselves, then
either the <part-content> in the spec is wrong or, IMO, the spec
needs some words about how receivers are expected to handle
content that lies outside the specification.

> I do not want some clients presenting the "J" as a "J" and
> some as a smiley (think Wingdings) and some as a Unicode
> replacement character.

It the "J" you are using is actually a character outside the
ASCII repertoire, then either your mail system, the IETF's, or
mine did use a replacement character and that should be a
warning to all of us (and not only in this particular context).  

However, even for "normal" emoji, there are risks if you, as the
sender, are expecting a particular grapheme to be delivered.
For example, if one has "thumbs-up" (U+1F44D) (cited several
times in discussion in the I-D and on the <base-emojis> list),
there is no way to guarantee whether you will get something that
looks like the left hand with the thumb up, the right hand with
the thumb up, or neither.  And, further to Patrik's point, at
least one of those is an obscenity in some cultures.

> The easy way out is to not restrict the allowable set of
> codepoints, which means allowing the shrug sequence and the
> table rage sequence above.  I will note that the draft's
> grammar allows whitespace between each "emoji", or let's call
> them individual emotions, which means "Great Job" could parse
> as two emotions.  Or not.  My preferred solution is still that
> all non-emoji (according to TR51) should be presented as if
> they were unallocated code points.

Well, that takes out Adam's example and your "J".  It (and the
current spec) also allow some interesting sequences, such as the
"police shoot crocodile" (or vice versa) example in UTS#51.
But I have a different concern that I have not raised
(recently):  my expectation is that, at least unless the spec
warns them to do something differently (and I don't think it
should or that they would if we told them to), the typical MUA
is going to take whatever appears in <part-content> and pass it
off as a string or unexamined Unicode code points to whatever it
uses to render strings and put them on the screen (or whatever).
Now suppose that the rendering routine receives an emoji
sequence or emoji-containing sequence that is not allowed by
UTR#51.  Does it not even bother with UTS#51's rule and try
rendering whatever it gets anyway (as it might if it didn't know
that emoji rendering was in any way special)?  Does it guess at
what was intended, perhaps more or less the way font
substitution works?  Does it get confused and display
semi-random garbage?  Or does it give the user, or arrange for
the user to be given, a clear error message about being passed
an invalid or nonsense string?

And, in terms of the I-D --whether your proposal is adopted or
we stay with UTS#51's definition of emoji sequence-- do we give
any advice to receiving systems about those cases, strengthen
the wording about "operational problems" in Section 7, or just
assume that it, like other bad stuff, will be reported.
Personally, I'd prefer to see some advice or warning because, if
the complex or silly cases blow up in the faces of users, it
could get this whole idea and undeserved bad reputation.

> Speaking of whitespace, the grammar uses LWSP = *(WSP / CRLF
> WSP) This is IMHO at odds with "The content of this part is
> restricted to single line of emoji."  Why allow CRLF if only a
> single line is allowed?  Why restrict to a single line?
>...

I'm going to confine myself to Unicode and i18n issues (and
disclaim ABNF expertise as a matter of habit) and let the
authors respond to that, and your remaining, comments -- I hope
not just by telling you its very late to be raising those issues.

best,
   john

[1] I'm avoiding "byte" because the vast number of code points
people think of as emoji, even the assorted faces in other
symbols in the "Miscellenous Symbols" block at U+2600-U+26FFm
cannot be represented in a single octet in any Unicode encoding
scheme.  The exceptions (#, *, 0-9, ©, and ® ) are, I believe,
not what most people think of when they think of emoji.  See
https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-data.txt.
I assume you know that already, but some people reading this may
not and it emphasizes the importance of being very precise when
talking about these things.

-- 
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call