On Tue, Mar 2, 2021, at 9:15 AM, John C Klensin wrote:
I don't know whose concern was to make that particular switchand why, but my concern about either (and, I'm guessing,Martin's) is that almost all Unicode code points (those outsidethe ASCII range) require more than one octet to represent in anyencoding scheme. For UTF-8, which the I-D requires, the numberof octets is variable. So using "octet" as a unit of --well,much of anything--is, at best, confusing.
"octet" appears in two places.
One:
The rule emoji_sequence is inherited from [Emoji-Seq]. It defines a
set of octet sequences, each of which forms a single pictograph.
I would replace "octet" with "code point". The referenced document only describes sequences of code points. The encoding of those into octets is orthogonal, and will be described by the content-type and content-transfer-encoding jointly. So, I think this change is a definite improvement to accuracy, and is worth making.
Two:
Reference to unallocated code points SHOULD NOT be treated as an
error; the corresponding octets SHOULD be processed using the system
default method for denoting an unallocated or undisplayable code
point.
I suggest the same change. It's -maybe- more debatable. But this document is describing what to do with the decoded content, because it doesn't describe anything about C-T-E or charset decoding. We must assume that the decoding layer has done its job and now we either have a total error or a codepoint sequence. (Some decode layers will have been instructed to hand back REPLACEMENT CHARACTER when the octet sequence was mangled, which will not be a valid emoji sequence, and everything works out.)
--
rjbs
-- last-call mailing list last-call@xxxxxxxx https://www.ietf.org/mailman/listinfo/last-call