Kjetil, Disclaimer: I read your note and started this response just after sending my long one but was interrupted by a meeting/ call I had to take. So, if there are messages that have arrived since mine was sent, I have not read them yet. If they bear on this part of the conversation, my apologies. The much longer note I just posted covers some of this and I will try to not repeat myself, but a few comments: First and most important, thanks for the reminder that issues of this general family were raised much earlier and not really addressed and this is not just Patrik and myself having late attacks of Unicode- or 118n-anxiety. More specific comments in line below (with some trimming)... --On Thursday, February 25, 2021 22:05 +0100 Kjetil Torgrim Homme <kjetilho@xxxxxxxxxx> wrote: >... > This was partially one of the points I made earlier. The > draft is eerily silent on what to do with a reaction like "J". > It is not an unallocated code point, but it is not a valid > emoji either. When I brought it up, Dave seemed to expect it > to be presented as the plain "J" it is. (Also consider the > draft explicitly accepts single-byte emojis, even though this > is at odds with Emoji-Seq.) Actually, that is an odd near-ambiguity in the document that I think should be corrected as other things are sorted out (and, if needed, flagged to the RFC Editor before publication). If one puts the URL aside as a convenience, the reference for [Emoji-Seq] points to UTS#51, which clearly allows single-code point [1] emoji and even some traditional symbols. On the other hand, the link is to http://www.unicode.org/reports/tr51/#def_emoji_sequence, which, at least today, is rule ED-17 of the Version 13.1, 2020-09-18 version of UTS#51. It points back to ED-15 for <emoji_core_sequence> which rather clearly (at least IMO) allows a single emoji character, which leads to ED-3 and single code point emoji. So, probably that is consistent in the document. Whether it is reasonable to expect anyone implementing the I-D to search through that is another question but is part of my concern about incorporating UTS#51 by reference and moving on. On the other hand, because "J" (I'm getting U+0022 from your message -- see below-- and not a special symbol) does not appear to have the emoji property, I believe the I-D forbids it entirely and having it appear as "J" would violate the spec. If the intention is that non-emoji appear as themselves, then either the <part-content> in the spec is wrong or, IMO, the spec needs some words about how receivers are expected to handle content that lies outside the specification. > I do not want some clients presenting the "J" as a "J" and > some as a smiley (think Wingdings) and some as a Unicode > replacement character. It the "J" you are using is actually a character outside the ASCII repertoire, then either your mail system, the IETF's, or mine did use a replacement character and that should be a warning to all of us (and not only in this particular context). However, even for "normal" emoji, there are risks if you, as the sender, are expecting a particular grapheme to be delivered. For example, if one has "thumbs-up" (U+1F44D) (cited several times in discussion in the I-D and on the <base-emojis> list), there is no way to guarantee whether you will get something that looks like the left hand with the thumb up, the right hand with the thumb up, or neither. And, further to Patrik's point, at least one of those is an obscenity in some cultures. > The easy way out is to not restrict the allowable set of > codepoints, which means allowing the shrug sequence and the > table rage sequence above. I will note that the draft's > grammar allows whitespace between each "emoji", or let's call > them individual emotions, which means "Great Job" could parse > as two emotions. Or not. My preferred solution is still that > all non-emoji (according to TR51) should be presented as if > they were unallocated code points. Well, that takes out Adam's example and your "J". It (and the current spec) also allow some interesting sequences, such as the "police shoot crocodile" (or vice versa) example in UTS#51. But I have a different concern that I have not raised (recently): my expectation is that, at least unless the spec warns them to do something differently (and I don't think it should or that they would if we told them to), the typical MUA is going to take whatever appears in <part-content> and pass it off as a string or unexamined Unicode code points to whatever it uses to render strings and put them on the screen (or whatever). Now suppose that the rendering routine receives an emoji sequence or emoji-containing sequence that is not allowed by UTR#51. Does it not even bother with UTS#51's rule and try rendering whatever it gets anyway (as it might if it didn't know that emoji rendering was in any way special)? Does it guess at what was intended, perhaps more or less the way font substitution works? Does it get confused and display semi-random garbage? Or does it give the user, or arrange for the user to be given, a clear error message about being passed an invalid or nonsense string? And, in terms of the I-D --whether your proposal is adopted or we stay with UTS#51's definition of emoji sequence-- do we give any advice to receiving systems about those cases, strengthen the wording about "operational problems" in Section 7, or just assume that it, like other bad stuff, will be reported. Personally, I'd prefer to see some advice or warning because, if the complex or silly cases blow up in the faces of users, it could get this whole idea and undeserved bad reputation. > Speaking of whitespace, the grammar uses LWSP = *(WSP / CRLF > WSP) This is IMHO at odds with "The content of this part is > restricted to single line of emoji." Why allow CRLF if only a > single line is allowed? Why restrict to a single line? >... I'm going to confine myself to Unicode and i18n issues (and disclaim ABNF expertise as a matter of habit) and let the authors respond to that, and your remaining, comments -- I hope not just by telling you its very late to be raising those issues. best, john [1] I'm avoiding "byte" because the vast number of code points people think of as emoji, even the assorted faces in other symbols in the "Miscellenous Symbols" block at U+2600-U+26FFm cannot be represented in a single octet in any Unicode encoding scheme. The exceptions (#, *, 0-9, ©, and ® ) are, I believe, not what most people think of when they think of emoji. See https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-data.txt. I assume you know that already, but some people reading this may not and it emphasizes the importance of being very precise when talking about these things. -- last-call mailing list last-call@xxxxxxxx https://www.ietf.org/mailman/listinfo/last-call