Okay, here are some more tests.
http://www.sw.it.aoyama.ac.jp/2013/pub/json_tests/test1_utf8_nobom.json
http://www.sw.it.aoyama.ac.jp/2013/pub/json_tests/test2_utf8_bom.json
They are self-describing JSON files served with application/json, the
first without a BOM, and the second with a BOM.
They contain some Japanese, and a tiny bit of Spanish.
[see more below]
On 2013/11/18 21:59, Henry S. Thompson wrote:
Bjoern Hoehrmann writes:
Perl's JSON module gives me
malformed JSON string, neither array, object, number, string
or atom, at character offset 0 (before "\x{ef}\x{bb}\x{bf}[]")
Python's json module gives me
ValueError: No JSON object could be decoded
Go's "encoding/json" module gives me
invalid character 'ï' looking for beginning of value
I'm curious to know what level you're invoking the parser at. As
implied by my previous post about the Python 'requests' package, it
handles application/json resources by stripping any initial BOM it
finds -- you can try this with
import requests
r=requests.get("http://www.ltg.ed.ac.uk/ov-test/b16le.json")
r.json()
I get a 404 on this example. I can put up UTF-16 examples, too.
Regards, Martin.
Signatures are not part of the text of a document, as the UNICODE spec
makes clear, so asking what happens when you pass a string beginning
with a BOM to a parser is not really the right question in this
context, is it?
As I tried to say in an earlier post, there's a distinction which
needs to be carefully insisted on between, on the one hand, languages
and their parsers, where I agree signatures/BOMs have no place, and,
on the other hand, (media-typed) resources/entities/payloads and _their_
processing, where a discussion of BOMs/signatures _is_ appropriate
and, often, necessary.
BTW I agree that the status of the UTF-8 BOM as signature is slightly
hazy, but again the UNICODE spec itself [1] says
"this sequence can serve as signature for UTF-8 encoded text where
the character set is unmarked"
ht
[1] http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf