On Thu, 2021-02-25 at 22:09 -0500, John C Klensin wrote: > --On Thursday, February 25, 2021 22:05 +0100 Kjetil Torgrim > Homme <kjetilho@xxxxxxxxxx> wrote: > > > ... > > This was partially one of the points I made earlier. The > > draft is eerily silent on what to do with a reaction like "J". > > It is not an unallocated code point, but it is not a valid > > emoji either. When I brought it up, Dave seemed to expect it > > to be presented as the plain "J" it is. (Also consider the > > draft explicitly accepts single-byte emojis, even though this > > is at odds with Emoji-Seq.) > > Actually, that is an odd near-ambiguity in the document that I > think should be corrected as other things are sorted out (and, > if needed, flagged to the RFC Editor before publication). If > one puts the URL aside as a convenience, the reference for > [Emoji-Seq] points to UTS#51, which clearly allows single-code > point [1] emoji and even some traditional symbols. On the other > hand, the link is to > http://www.unicode.org/reports/tr51/#def_emoji_sequence, which, > at least today, is rule ED-17 of the Version 13.1, 2020-09-18 > version of UTS#51. It points back to ED-15 for > <emoji_core_sequence> which rather clearly (at least IMO) allows > a single emoji character, which leads to ED-3 and single code > point emoji. So, probably that is consistent in the document. > Whether it is reasonable to expect anyone implementing the I-D > to search through that is another question but is part of my > concern about incorporating UTS#51 by reference and moving on. I think that is entirely reasonable. The rule references are collected in one place (except the \P{} mechanism) and the syntax used is close enough to ABNF to be easy to follow for anyone working with IETF standards before. > On the other hand, because "J" (I'm getting U+0022 from your > message -- see below-- and not a special symbol) does not appear > to have the emoji property, I believe the I-D forbids it > entirely and having it appear as "J" would violate the spec. Yes, it is indeed a plain "LATIN CAPITAL LETTER J", and yes it violates spec, but so does any of the unallocated code points. Why spell out how to handle one illegal character sequence and not the other? > If > the intention is that non-emoji appear as themselves, then > either the <part-content> in the spec is wrong or, IMO, the spec > needs some words about how receivers are expected to handle > content that lies outside the specification. Exactly. The Postel Principle is both a blessing and a curse, but more often than not the latter, IMHO. I want more strictness in general. Existing text + suggested clarification: Reference to unallocated code points SHOULD NOT be treated as an error; associated bytes SHOULD be processed using the system default method for denoting an unallocated or undisplayable code point. + Code points from the private use area MUST NOT be used. + Other violations of the grammar SHOULD cause the part to be + discarded. and in section 3, inject a new step 4 like: + 4. Reaction parts which are not associated with a valid In-Reply-To + SHOULD be discarded. > > I do not want some clients presenting the "J" as a "J" and > > some as a smiley (think Wingdings) and some as a Unicode > > replacement character. > > It the "J" you are using is actually a character outside the > ASCII repertoire, then either your mail system, the IETF's, or > mine did use a replacement character and that should be a > warning to all of us (and not only in this particular context). It used to be very common to get mail from Microsoft users with "J" strewn across the message - in old Wingdings, the code point "J" was replaced by a smiley face. For users without the Wingdings font, a normal font was substituted, revealing its true code point. See? J > However, even for "normal" emoji, there are risks if you, as the > sender, are expecting a particular grapheme to be delivered. > For example, if one has "thumbs-up" (U+1F44D) (cited several > times in discussion in the I-D and on the <base-emojis> list), > there is no way to guarantee whether you will get something that > looks like the left hand with the thumb up, the right hand with > the thumb up, or neither. And, further to Patrik's point, at > least one of those is an obscenity in some cultures. Sure, but this is impossible to safeguard against. E.g., I myself like to use U+1F64F 🙏 PERSON WITH FOLDED HANDS as a symbol for thanks, since it looks like the gesture of namaste. I was mulling over using textual labels to reduce the chance of this happening, but in reality there *is* a textual label already, and a ":pray:" textual label (like in Slack or Mattermost) would not be visible to end users any more than the Unicode label is. > > The easy way out is to not restrict the allowable set of > > codepoints, which means allowing the shrug sequence and the > > table rage sequence above. I will note that the draft's > > grammar allows whitespace between each "emoji", or let's call > > them individual emotions, which means "Great Job" could parse > > as two emotions. Or not. My preferred solution is still that > > all non-emoji (according to TR51) should be presented as if > > they were unallocated code points. > > Well, that takes out Adam's example and your "J". I don't actually have a strong preference for disallowing them or presenting them as undefined. But yes. If we are going to be lenient, let's be explicitly lenient and allow any single line response, e.g., "Bravo!" as well as "<U+1F44D>" > I'm going to confine myself to Unicode and i18n issues (and > disclaim ABNF expertise as a matter of habit) and let the > authors respond to that, and your remaining, comments -- I hope > not just by telling you its very late to be raising those issues. I understand where he's coming from. I am following this from "last- call" only, for good or for bad. Perhaps there should be pointers to where to go for archived discussion when posting to last-call. I don't actually know what mailing list this draft was discussed on. > [1] I'm avoiding "byte" because the vast number of code points > people think of as emoji, even the assorted faces in other > symbols in the "Miscellenous Symbols" block at U+2600-U+26FFm > cannot be represented in a single octet in any Unicode encoding > scheme. The exceptions (#, *, 0-9, ©, and ® ) are, I believe, > not what most people think of when they think of emoji. See > https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-data.txt. > I assume you know that already, but some people reading this may > not and it emphasizes the importance of being very precise when > talking about these things. Actually I did not know or glossed over it, since I consulted "full- emoji-list" instead. So yes, #*0123456789 are all plain ASCII characters that have the Emoji property set. In other words, I was wrong when I said a single byte emoji was impossible. I do however still think it is unfortunate terminology. Better to use "character" or "code point" as you too indicated elsewhere. -- venleg helsing, Kjetil T. -- last-call mailing list last-call@xxxxxxxx https://www.ietf.org/mailman/listinfo/last-call