utf8 BOM

Dmitry Potapov <dpotapov@xxxxxxxxx> · Fri, 14 May 2010 14:16:48 +0400

On Thu, May 13, 2010 at 01:47:45PM +0200, Eyvind Bernhardsen wrote:
> 
> I just did a quick test with a plain text file; it was detected as
> text both with and without a utf8 BOM.  Looking at the code,
> characters >= 128 are considered printable so the BOM shouldn't make
> any difference at all.  Do you have an example utf8 text file that is
> misdetected as binary?

Though UTF-8 BOM does not present any problem for automatic text
detector, it is another piece from Microsoft that creates some
interoperability issues when you work with non-ASCII text files.
In short:

1. Microsoft editors and tools like to add utf8 BOM to files, and
   you cannot turn this behavior off.
2. Many tools (such as Microsoft compiler) incapable to recognize
   UTF-8 files without BOM, so they screw up all non-ASCII chars.

#1 is a problem, because it creates changes consisting solely of adding
utf8 BOM. Moreover, users of non-Windows platforms are not exactly
thrilled with having utf8 BOM at the beginning of every text file.

Probably, ability of automatic add utf8 BOM on Windows to text files
(which are marked as "unicode") can be helpful, but it is just a part
of the problem of how to deal with text files in "legacy" encoding,
which are still widely used on Windows.

Dmitry
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html