Re: [Last-Call] New Version Notification for draft-crocker-inreply-react-07.txt

Barry Leiba <barryleiba@xxxxxxxxxxxx> · Tue, 2 Mar 2021 12:52:56 -0500

I think Ned's proposed fix here is exactly right.

Barry

On Tue, Mar 2, 2021 at 11:40 AM Ned Freed <ned.freed@xxxxxxxxxxx> wrote:
>
> > On Tue, Mar 2, 2021, at 9:15 AM, John C Klensin wrote:
> > > I don't know whose concern was to make that particular switch
> > > and why, but my concern about either (and, I'm guessing,
> > > Martin's) is that almost all Unicode code points (those outside
> > > the ASCII range) require more than one octet to represent in any
> > > encoding scheme.  For UTF-8, which the I-D requires, the number
> > > of octets is variable.  So using "octet" as a unit of --well,
> > > much of anything--is, at best, confusing.
>
> > "octet" appears in two places.
>
> > One:
>
> >    The rule emoji_sequence is inherited from [Emoji-Seq].  It defines a
> >    set of octet sequences, each of which forms a single pictograph.
>
> > I would replace "octet" with "code point".  The referenced document only
> > describes sequences of code points.  The encoding of those into octets is
> > orthogonal, and will be described by the content-type and
> > content-transfer-encoding jointly.  So, I think this change is a definite
> > improvement to accuracy, and is worth making.
>
> Sigh. I noticed but then completely forgot about this. emoji_seq only goes as
> far as code points, which leaves out the subsequent UTF-8 encoding.
>
> More specifically, the base productions at the bottom of the ABNF are things
> like:
>
>   emoji_character := \p{Emoji}
>   default_emoji_presentation_character := \p{Emoji_Presentation}
>   default_text_presentation_character := \P{Emoji_Presentation}
>   emoji_modifier := \p{Emoji_Modifier}
>
> A definition in terms of a regexp that can apply to any encoding of Unicode.
> Which of course makes sense in the context of this standard. Why get into
> encoding when you don't have to?
>
> For us the problem is when used as a production, many possible sets of octet
> sequents are possible, and we need to select the right one. So the text is
> actually incorrect, and needs to be fixed. I suggest:
>
>     The rule emoji_sequence is inherited from [Emoji-Seq].  It defines a
>     set of Unicode code point sequences, which must then be encoded as UTF-8.
>     Each such sequence forms a single pictograph.
>
> > Two:
>
> >    Reference to unallocated code points SHOULD NOT be treated as an
> >    error; the corresponding octets SHOULD be processed using the system
> >    default method for denoting an unallocated or undisplayable code
> >    point.
>
> > I suggest the same change.  It's -maybe- more debatable.  But this document
> > is describing what to do with the decoded content, because it doesn't describe
> > anything about C-T-E or charset decoding.  We must assume that the decoding
> > layer has done its job and now we either have a total error or a codepoint
> > sequence.  (Some decode layers will have been instructed to hand back
> > REPLACEMENT CHARACTER when the octet sequence was mangled, which will not be a
> > valid emoji sequence, and everything works out.)
>
> Given the clarification above I don't think a change here is strictly required,
> but it wouldn't hurt to reemphasize the point about UTF-8. So:
>
>    ... the corresponding UTF-8 encoded code points ...
>
> would work. And it has the added benefit of not using the words "octet" or
> "byte" that some find distastful. (The production of <whatever> is implicit in
> the UtF-8 reference.)
>
> Sorry for not catching this sooner.
>
>                                 Ned
>
> --
> last-call mailing list
> last-call@xxxxxxxx
> https://www.ietf.org/mailman/listinfo/last-call

-- 
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call