Re: [Last-Call] New Version Notification for draft-crocker-inreply-react-07.txt

Martin J. Dürst <duerst@xxxxxxxxxxxxxxx> · Wed, 3 Mar 2021 08:40:12 +0900

Hello Ned, others,

Thanks for your changes, they look great.

For the record, I do not think that "octet" or "byte" are in any way 
distasteful. It's just that they are appropriate in some contexts, but 
not in others, and your proposed text(s) fixes that.

Regards,   Martin.

On 03/03/2021 01:27, Ned Freed wrote:
On Tue, Mar 2, 2021, at 9:15 AM, John C Klensin wrote:
I don't know whose concern was to make that particular switch
and why, but my concern about either (and, I'm guessing,
Martin's) is that almost all Unicode code points (those outside
the ASCII range) require more than one octet to represent in any
encoding scheme.  For UTF-8, which the I-D requires, the number
of octets is variable.  So using "octet" as a unit of --well,
much of anything--is, at best, confusing.

"octet" appears in two places.

One:

    The rule emoji_sequence is inherited from [Emoji-Seq].  It defines a
    set of octet sequences, each of which forms a single pictograph.

I would replace "octet" with "code point".  The referenced document only
describes sequences of code points.  The encoding of those into octets is
orthogonal, and will be described by the content-type and
content-transfer-encoding jointly.  So, I think this change is a definite
improvement to accuracy, and is worth making.

Sigh. I noticed but then completely forgot about this. emoji_seq only goes as
far as code points, which leaves out the subsequent UTF-8 encoding.

More specifically, the base productions at the bottom of the ABNF are things
like:

   emoji_character := \p{Emoji}
   default_emoji_presentation_character := \p{Emoji_Presentation}
   default_text_presentation_character := \P{Emoji_Presentation}
   emoji_modifier := \p{Emoji_Modifier}

A definition in terms of a regexp that can apply to any encoding of Unicode.
Which of course makes sense in the context of this standard. Why get into
encoding when you don't have to?

For us the problem is when used as a production, many possible sets of octet
sequents are possible, and we need to select the right one. So the text is
actually incorrect, and needs to be fixed. I suggest:

     The rule emoji_sequence is inherited from [Emoji-Seq].  It defines a
     set of Unicode code point sequences, which must then be encoded as UTF-8.
     Each such sequence forms a single pictograph.

Two:

    Reference to unallocated code points SHOULD NOT be treated as an
    error; the corresponding octets SHOULD be processed using the system
    default method for denoting an unallocated or undisplayable code
    point.

I suggest the same change.  It's -maybe- more debatable.  But this document
is describing what to do with the decoded content, because it doesn't describe
anything about C-T-E or charset decoding.  We must assume that the decoding
layer has done its job and now we either have a total error or a codepoint
sequence.  (Some decode layers will have been instructed to hand back
REPLACEMENT CHARACTER when the octet sequence was mangled, which will not be a
valid emoji sequence, and everything works out.)

Given the clarification above I don't think a change here is strictly required,
but it wouldn't hurt to reemphasize the point about UTF-8. So:

    ... the corresponding UTF-8 encoded code points ...

would work. And it has the added benefit of not using the words "octet" or
"byte" that some find distastful. (The production of <whatever> is implicit in
the UtF-8 reference.)

Sorry for not catching this sooner.

				Ned

--
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call