Supporting automated removal of the UTF-8 BOM

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello team,

I am curious if you would be open to an effort to extend git with the
ability to actively manage the presence of a UTF-8 BOM in the index (and
working tree), probably via the .gitattributes interface. I'm aware that
the existing encoding mechanism is already BOM-aware can deduce from its
presence the charset of a file, but unfortunately it doesn't seem to be
possible to have git strip the BOM from *UTF-8* content (or particular
content) before storing it in the index without the use of precommit
hooks.

After giving it some thought and assuming that you would be open to the
idea in principle, I can see several different approaches or
possibilities.

One option would be to add a new charset named UTF-8-BOM for the express
purpose of allowing particular filetypes to always be stored BOM-free in
the index (as iconv does not itself recognize this as a separate
charset). This would be along the same vein in which support was added
for UTF-16XX-BOM such that content can be converted to (BOM-free) UTF-8
for storage in the index and then converted back to UTF-16XX-BOM when
checked out.

While it's true that most editors that emit UTF-8 files w/ a BOM will
still work with them if the BOM is removed (obviating the *need* to
convert back to UTF-8-BOM on checkout, unlike how editors expecting
UTF-16LE-BOM will fail if the file is checked out as UTF-8), this would
prevent users on other platforms from having to deal with the BOM when a
Windows user checks the file in, and would also prevent the needless
churn from carelessly committed diffs adding or removing the BOM.

An alternative option that wouldn't involve adding a new UTF-8-BOM
charset would be to make core git aware of the BOM and able to treat its
presence/absence as consequential or to be ignored in the same fashion
as how git currently has an option to transparently convert line endings
between lf and crlf at commit/checkout, but I was under the impression
that this is largely considered a dirty hack to handle a very common
problem and not something that the team would be eager to expand on by
adding a similar option for BOM markers.

One other option would be to add support for BOM removal only through
.gitattributes but not via an explicit charset conversion, i.e. by
adding a "bom" option like the "eol" option that could be specified for
particular file types, for example,

    *.csproj text=auto eol=crlf bom=[add|strip]

With bom=add, the file would be stored in the index without a BOM but it
would be added on checkout (as a parallel to how eol=crlf acts). With
bom=strip, the file would again be stored in the index without a BOM and
git would prevent addition of the BOM on checkout.

Curious about your thoughts on the matter. I've gone to great lengths to
purge the BOM from my machine and have managed to hack most of
Microsoft's tools to refrain from adding it but alas Visual Studio still
insists on inserting the BOM to machine-generated content (for the
reset, I use a VS extension to strip the BOM from regular code files
opened/saved through Visual Studio).

Thanks for the time,

Mahmoud



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux