Re: [Last-Call] New Version Notification for draft-crocker-inreply-react-07.txt

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On Tue, Mar 2, 2021, at 9:15 AM, John C Klensin wrote:
> > I don't know whose concern was to make that particular switch
> > and why, but my concern about either (and, I'm guessing,
> > Martin's) is that almost all Unicode code points (those outside
> > the ASCII range) require more than one octet to represent in any
> > encoding scheme.  For UTF-8, which the I-D requires, the number
> > of octets is variable.  So using "octet" as a unit of --well,
> > much of anything--is, at best, confusing.

> "octet" appears in two places.

> One:

>    The rule emoji_sequence is inherited from [Emoji-Seq].  It defines a
>    set of octet sequences, each of which forms a single pictograph.

> I would replace "octet" with "code point".  The referenced document only
> describes sequences of code points.  The encoding of those into octets is
> orthogonal, and will be described by the content-type and
> content-transfer-encoding jointly.  So, I think this change is a definite
> improvement to accuracy, and is worth making.

Sigh. I noticed but then completely forgot about this. emoji_seq only goes as
far as code points, which leaves out the subsequent UTF-8 encoding.

More specifically, the base productions at the bottom of the ABNF are things
like:

  emoji_character := \p{Emoji}
  default_emoji_presentation_character := \p{Emoji_Presentation}
  default_text_presentation_character := \P{Emoji_Presentation}
  emoji_modifier := \p{Emoji_Modifier}

A definition in terms of a regexp that can apply to any encoding of Unicode.
Which of course makes sense in the context of this standard. Why get into
encoding when you don't have to?

For us the problem is when used as a production, many possible sets of octet
sequents are possible, and we need to select the right one. So the text is
actually incorrect, and needs to be fixed. I suggest:

    The rule emoji_sequence is inherited from [Emoji-Seq].  It defines a
    set of Unicode code point sequences, which must then be encoded as UTF-8.
    Each such sequence forms a single pictograph.

> Two:

>    Reference to unallocated code points SHOULD NOT be treated as an
>    error; the corresponding octets SHOULD be processed using the system
>    default method for denoting an unallocated or undisplayable code
>    point.

> I suggest the same change.  It's -maybe- more debatable.  But this document
> is describing what to do with the decoded content, because it doesn't describe
> anything about C-T-E or charset decoding.  We must assume that the decoding
> layer has done its job and now we either have a total error or a codepoint
> sequence.  (Some decode layers will have been instructed to hand back
> REPLACEMENT CHARACTER when the octet sequence was mangled, which will not be a
> valid emoji sequence, and everything works out.)

Given the clarification above I don't think a change here is strictly required,
but it wouldn't hurt to reemphasize the point about UTF-8. So:

   ... the corresponding UTF-8 encoded code points ...

would work. And it has the added benefit of not using the words "octet" or
"byte" that some find distastful. (The production of <whatever> is implicit in
the UtF-8 reference.)

Sorry for not catching this sooner.

				Ned

-- 
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call



[Index of Archives]     [IETF Annoucements]     [IETF]     [IP Storage]     [Yosemite News]     [Linux SCTP]     [Linux Newbies]     [Mhonarc]     [Fedora Users]

  Powered by Linux