Re: [Last-Call] New Version Notification for draft-crocker-inreply-react-07.txt

John C Klensin <john@xxxxxxx> · Tue, 02 Mar 2021 14:33:09 -0500

Ned,

This works, as would almost anything else that gets rid of
things that imply specific-length objects like "octet" or "byte".

However (consider this background/ informational if you don't
want to make further changes)... My preference, given both the
heavy dependency on UTS#51 in particular and the Unicode specs
in particular would be to stick with Unicode terminology unless
we have a substantive need to depart from it.  That would imply
using "code point" throughout, partially to avoid getting tied
up with encodings or local representations.  In particular,
while the spec requires UTF-8 on the wire, these strings will
actually be processed my MUAs and libraries and many operating
systems do not store or use UTF-8 internally, making the UTF-8
emphasis confusing.

But I do not feel strongly about the above in this context, so,
again, your changes work for me.

    john

--On Tuesday, 02 March, 2021 08:27 -0800 Ned Freed
<ned.freed@xxxxxxxxxxx> wrote:

>> On Tue, Mar 2, 2021, at 9:15 AM, John C Klensin wrote:
>> > I don't know whose concern was to make that particular
>> > switch and why, but my concern about either (and, I'm
>> > guessing, Martin's) is that almost all Unicode code points
>> > (those outside the ASCII range) require more than one octet
>> > to represent in any encoding scheme.  For UTF-8, which the
>> > I-D requires, the number of octets is variable.  So using
>> > "octet" as a unit of --well, much of anything--is, at best,
>> > confusing.
> 
>> "octet" appears in two places.
> 
>> One:
> 
>>    The rule emoji_sequence is inherited from [Emoji-Seq].  It
>>    defines a set of octet sequences, each of which forms a
>>    single pictograph.
> 
>> I would replace "octet" with "code point".  The referenced
>> document only describes sequences of code points.  The
>> encoding of those into octets is orthogonal, and will be
>> described by the content-type and content-transfer-encoding
>> jointly.  So, I think this change is a definite improvement
>> to accuracy, and is worth making.
> 
> Sigh. I noticed but then completely forgot about this.
> emoji_seq only goes as far as code points, which leaves out
> the subsequent UTF-8 encoding.
> 
> More specifically, the base productions at the bottom of the
> ABNF are things like:
> 
>   emoji_character := \p{Emoji}
>   default_emoji_presentation_character :=
> \p{Emoji_Presentation}   default_text_presentation_character
> := \P{Emoji_Presentation}   emoji_modifier :=
> \p{Emoji_Modifier}
> 
> A definition in terms of a regexp that can apply to any
> encoding of Unicode. Which of course makes sense in the
> context of this standard. Why get into encoding when you don't
> have to?
> 
> For us the problem is when used as a production, many possible
> sets of octet sequents are possible, and we need to select the
> right one. So the text is actually incorrect, and needs to be
> fixed. I suggest:
> 
>     The rule emoji_sequence is inherited from [Emoji-Seq].  It
> defines a     set of Unicode code point sequences, which must
> then be encoded as UTF-8.     Each such sequence forms a
> single pictograph.
> 
>> Two:
> 
>>    Reference to unallocated code points SHOULD NOT be treated
>>    as an error; the corresponding octets SHOULD be processed
>>    using the system default method for denoting an
>>    unallocated or undisplayable code point.
> 
>> I suggest the same change.  It's -maybe- more debatable.  But
>> this document is describing what to do with the decoded
>> content, because it doesn't describe anything about C-T-E or
>> charset decoding.  We must assume that the decoding layer has
>> done its job and now we either have a total error or a
>> codepoint sequence.  (Some decode layers will have been
>> instructed to hand back REPLACEMENT CHARACTER when the octet
>> sequence was mangled, which will not be a valid emoji
>> sequence, and everything works out.)
> 
> Given the clarification above I don't think a change here is
> strictly required, but it wouldn't hurt to reemphasize the
> point about UTF-8. So:
> 
>    ... the corresponding UTF-8 encoded code points ...
> 
> would work. And it has the added benefit of not using the
> words "octet" or "byte" that some find distastful. (The
> production of <whatever> is implicit in the UtF-8 reference.)
> 
> Sorry for not catching this sooner.
> 
> 				Ned

-- 
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call