I think Ned's proposed fix here is exactly right. Barry On Tue, Mar 2, 2021 at 11:40 AM Ned Freed <ned.freed@xxxxxxxxxxx> wrote: > > > On Tue, Mar 2, 2021, at 9:15 AM, John C Klensin wrote: > > > I don't know whose concern was to make that particular switch > > > and why, but my concern about either (and, I'm guessing, > > > Martin's) is that almost all Unicode code points (those outside > > > the ASCII range) require more than one octet to represent in any > > > encoding scheme. For UTF-8, which the I-D requires, the number > > > of octets is variable. So using "octet" as a unit of --well, > > > much of anything--is, at best, confusing. > > > "octet" appears in two places. > > > One: > > > The rule emoji_sequence is inherited from [Emoji-Seq]. It defines a > > set of octet sequences, each of which forms a single pictograph. > > > I would replace "octet" with "code point". The referenced document only > > describes sequences of code points. The encoding of those into octets is > > orthogonal, and will be described by the content-type and > > content-transfer-encoding jointly. So, I think this change is a definite > > improvement to accuracy, and is worth making. > > Sigh. I noticed but then completely forgot about this. emoji_seq only goes as > far as code points, which leaves out the subsequent UTF-8 encoding. > > More specifically, the base productions at the bottom of the ABNF are things > like: > > emoji_character := \p{Emoji} > default_emoji_presentation_character := \p{Emoji_Presentation} > default_text_presentation_character := \P{Emoji_Presentation} > emoji_modifier := \p{Emoji_Modifier} > > A definition in terms of a regexp that can apply to any encoding of Unicode. > Which of course makes sense in the context of this standard. Why get into > encoding when you don't have to? > > For us the problem is when used as a production, many possible sets of octet > sequents are possible, and we need to select the right one. So the text is > actually incorrect, and needs to be fixed. I suggest: > > The rule emoji_sequence is inherited from [Emoji-Seq]. It defines a > set of Unicode code point sequences, which must then be encoded as UTF-8. > Each such sequence forms a single pictograph. > > > Two: > > > Reference to unallocated code points SHOULD NOT be treated as an > > error; the corresponding octets SHOULD be processed using the system > > default method for denoting an unallocated or undisplayable code > > point. > > > I suggest the same change. It's -maybe- more debatable. But this document > > is describing what to do with the decoded content, because it doesn't describe > > anything about C-T-E or charset decoding. We must assume that the decoding > > layer has done its job and now we either have a total error or a codepoint > > sequence. (Some decode layers will have been instructed to hand back > > REPLACEMENT CHARACTER when the octet sequence was mangled, which will not be a > > valid emoji sequence, and everything works out.) > > Given the clarification above I don't think a change here is strictly required, > but it wouldn't hurt to reemphasize the point about UTF-8. So: > > ... the corresponding UTF-8 encoded code points ... > > would work. And it has the added benefit of not using the words "octet" or > "byte" that some find distastful. (The production of <whatever> is implicit in > the UtF-8 reference.) > > Sorry for not catching this sooner. > > Ned > > -- > last-call mailing list > last-call@xxxxxxxx > https://www.ietf.org/mailman/listinfo/last-call -- last-call mailing list last-call@xxxxxxxx https://www.ietf.org/mailman/listinfo/last-call