Re: [Json] Gen-ART and OPS-Dir review of draft-ietf-json-text-sequence-09

Patrik Fältström <paf@xxxxxxxxxx> · Mon, 8 Dec 2014 06:42:32 +0100

> On 7 dec 2014, at 22:07, John Cowan <cowan@xxxxxxxxxxxxxxxx> wrote:
> 
> Patrik Fältström scripsit:
> 
>> I.e. the way I read draft-ietf-json-text-sequence (and I might be
>> wrong), you have specific octet values that act as separators. That
>> only works if the encoding is UTF-8.
> 
> This is a binary representation which has embedded JSON texts represented
> in UTF-8.  Since the first character in a JSON text is necessarily in
> the ASCII repertoire, it is not possible to parse a UTF-16 or UTF-32
> JSON text as UTF-8 and come out with valid JSON.

My point is that if you talk about what specific characters or reference RFC20 or what not, then you only get RS if you use UTF-8 encoding. If you use UTF-16, then you neither have RS as one octet (0x1E), nor is RS the only character that do have 0x1E as one of the octets.

I think the problem is that I do not know what "octet string" is. You either have UTF-8 encoded Unicode strings, or... ;-) In this case, you have a series of UTF-8 encoded Unicode Strings, right? Separated by the octet 0x1E, which happen to also be a correctly encoded Unicode character -- the Information Separator Two. This implies the whole thing is a UTF-8 encoded text that is to be parsed like this:

possible-JSON = 1*(not-RS); UTF-8-encoded JSON text
 ; (as specified in RFC7159, but only UTF-8 allowed)

I.e. the blob, to be compliant with this document, MUST be UTF-8 encoded JSON.

Right?

> However, I grant that mentioning UTF-8 only in an ABNF comment is not
> really prominent enough.  Proposed wording change:
> 
> For:
> 
>   In prose: a series of octet strings, each containing any octet other
>   than a record separator (RS) (0x1E) [RFC0020], all octet strings
>   separated from each other by RS octets.  Each octet string in the
>   sequence is to be parsed as a JSON text.
> 
> read:
> 
>   In prose: a series of octet strings, each containing any octet other
>   than a record separator (RS) (0x1E) [RFC0020], all octet strings
>   separated from each other by RS octets.  Each octet string in the
>   sequence is to be parsed as a JSON text in UTF-8 encoding.
> 
> and add a suitable reference to UTF-8.

I would say that what you have said above is:

This specifies a series of UTF-8 encoded Unicode strings. Each to be interpreted as JSON text. The strings are separated by the octet 0x1E (which is UTF-8 encoding of the Unicode Character U+001E - INFORMATION SEPARATOR TWO). This character because of this must be escaped, for example by using \u001E notation, if it exists in an attribute value.

>> Ok, so what you say is that a string in an attribute value in the JSON
>> blob can still start with U+FEFF?
> 
> Just so.

Good.

   Patrik

Attachment:
signature.asc

Description: Message signed with OpenPGP using GPGMail