On 9/17/17 11:43 PM, Carsten Bormann
wrote:
The reason for the BOM was so that existing tools will load the file correctly in absence of character encoding information. (AFAIR, the ability to make tools like Notepad "do the right thing" was an important step to actually get to the decision to allow non-ASCII characters). And yes, this is only relevant for plain text (as opposed to HTML), served from the file system.Employing the Byte Order Mark (BOM), which is needed in UTF-16 but not in UTF-8, as a file “signature” (magic number) to identify plain text files that use UTF-8 beyond ASCII, is well known to have caused many of the problems in migrating to UTF-8. The problems come both from tools that otherwise would have no problem upgrading from ASCII to UTF-8, that are now malfunctioning because of those BOMs, and from tools that now suddenly *expect* that all UTF-8 files beyond ASCII have that signature and no longer work when they don’t. The first set of problems is confounded by tools that are silently inserting BOMs at various stages of processing UTF-8 files (BOM pollution), and by other tools that make any BOM present in a plain text file invisible to casual examination so problems caused by BOM pollution are hard to recognize. The problems caused by BOM pollution were already well understood at the time when the various standards around UTF-8 were written. Unicode itself recommends against it. RFC 3629 has a whole section denouncing it. RFC 5198 is careful to avoid BOM pollution in network unicode. So the standards message is clear. No BOM pollution. Yet, on the operational side, the IETF has failed for more than a decade to properly serve UTF-8 in its own systems. Now that we finally provide RFCs with UTF-8 beyond ASCII, we go ahead and embrace BOM pollution as if we didn’t know what we are doing. This sends the message that BOM pollution is actually OK, even maybe the right thing everybody else should be doing as well, and the standards documents are for preaching on Sundays but to be ignored when it comes to actual practice. It’s as if we were running all our servers without security because that might be considered operationally more expedient. Grüße, Carsten The RFC Format Design Team discussed this at length. I'm copying
(with permission) the reports on the test that had us agreeing to
use a BOM in these files. The end result was that too many apps
could not display a UTF-8 file correctly without a BOM. -Heather
-------- Forwarded Message --------
Reality check...
I just ran a test with two UTF-8 files, one with a BOM and one without. In case you want to try them yourself, they're at http://research.microsoft.com/~dthaler/Utf8NoBom.txt http://research.microsoft.com/~dthaler/Utf8WithBom.txt It includes Latin, Greek, and Cyrillic.
I tried opening them with a bunch of utilities, and browsers (opening local files not using HTTP), and used browsershots.org to get screenshots of HTTP access across many browsers and platforms. Note the HTTP server provides no content encoding headers so it's up to the app to detect. I just copied the files to a generic web server, and we may expect others would do the same with their own I-Ds and RFC mirrors.
Results:
1) Some apps worked fine with both files. These include things like notepad, outlook, Word, file explorer, Visual Studio 2012
2) Some apps failed with both files (probably written to be ASCII only). These include things like Windiff, stevie (a vi clone), textpad, and the Links browser (on Ubuntu), and the Konquerer browser (on Ubuntu)
3) Everything else, including almost all browsers, only displayed the file correctly with the BOM This included: Windows apps: Wordpad Windows using local files (no HTTP): IE, Firefox, Chrome Windows using HTTP: IE, Firefox, Chrome, Navigator Mac OSX: Safari, Camino Debian: Opera, Dillo Ubuntu: Luakit, Iceape
Conclusion: If we want people to use UTF-8 RFCs and I-Ds with existing tools and browsers today, any UTF-8 text format needs to include a BOM.
-Dave
|