Re: BOMs

ht@xxxxxxxxxxxx (Henry S. Thompson) · Mon, 18 Nov 2013 12:59:18 +0000

Bjoern Hoehrmann writes:

> Perl's JSON module gives me
>
>   malformed JSON string, neither array, object, number, string
>   or atom, at character offset 0 (before "\x{ef}\x{bb}\x{bf}[]")
>
> Python's json module gives me
>
>   ValueError: No JSON object could be decoded
>
> Go's "encoding/json" module gives me
>
>   invalid character 'ï' looking for beginning of value

I'm curious to know what level you're invoking the parser at.  As
implied by my previous post about the Python 'requests' package, it
handles application/json resources by stripping any initial BOM it
finds -- you can try this with

>>> import requests
>>> r=requests.get("http://www.ltg.ed.ac.uk/ov-test/b16le.json";)
>>> r.json()

Signatures are not part of the text of a document, as the UNICODE spec
makes clear, so asking what happens when you pass a string beginning
with a BOM to a parser is not really the right question in this
context, is it?

As I tried to say in an earlier post, there's a distinction which
needs to be carefully insisted on between, on the one hand, languages
and their parsers, where I agree signatures/BOMs have no place, and,
on the other hand, (media-typed) resources/entities/payloads and _their_
processing, where a discussion of BOMs/signatures _is_ appropriate
and, often, necessary.

BTW I agree that the status of the UTF-8 BOM as signature is slightly
hazy, but again the UNICODE spec itself [1] says

  "this sequence can serve as signature for UTF-8 encoded text where
   the character set is unmarked"

ht

[1] http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf
-- 
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 650-4587, e-mail: ht@xxxxxxxxxxxx
                       URL: http://www.ltg.ed.ac.uk/~ht/
 [mail from me _always_ has a .sig like this -- mail without it is forged spam]