Re: RFC: Native clean/smudge filter for UTF-16 files

Jeff King <peff@xxxxxxxx> · Fri, 24 Nov 2017 13:04:01 -0500

On Thu, Nov 23, 2017 at 04:18:59PM +0100, Lars Schneider wrote:

> Alternatively, I could add a native attribute to Git that translates UTF-16 
> to UTF-8 and back. A conversion function is already available in "mingw.h" [3]
> on Windows. Limiting this feature to Windows wouldn't be a problem from my
> point of view as UTF-16 is only relevant on Windows anyways. The attribute 
> could look like this:
> 
>     *.txt        text encoding=utf-16
>
> There was a previous discussion on the topic and Jonathan already suggested
> a "native" clean/smudge filter in 2010 [4]. Also the "encoding" attribute
> is already present but, as far as I can tell, is only used by the git gui
> for viewing [5].

I would not want to see a proliferation of built-in filters, but it
really seems like text-encoding conversion is a broad and practical one
that many people might benefit from. So just like line-ending
conversion, which _could_ be done by generic filters, it makes sense to
me to support it natively for speed and simplicity.

> Do you think a patch that converts UTF-16 files to UTF-8 via an attribute
> "encoding=utf-16" on Windows would have a chance to get accepted?

You haven't fully specified the semantics here, so let me sketch out
what I think it ought to look like:

 - declare utf8 the "canonical" in-repo representation, just as we have
   declared LF for line endings (alternatively this could be
   configurable, but if we can get away with declaring utf8 the one true
   encoding, that cuts out a lot of corner cases).

 - if core.convertEncoding is true, then for any file with an
   encoding=foo attribute, internally run iconv(foo, utf8) in
   convert_to_git(), and likewise iconv(utf8, foo) in
   convert_to_working_tree.

 - I'm not sure if core.convertEncoding should be enabled by default. If
   it's a noop as long as there's no encoding attribute, then it's
   probably fine. But I would not want accidental conversion or any
   slowdown for the common case that the user wants no conversion.

 - I doubt we'd want a "core.autoEncoding" similar to "core.autocrlf". I
   don't think people consistently have all utf-16 files (the way they
   might have all CRLF files) rather a few files that must be utf-16.

 - I have actually seen two types of utf-16 in git repos in the wild:
   files which really must be utf-16 (because some tool demands it) and
   files which happen to be utf-16, but could just as easily be utf-8
   (and the user simply does not notice and commits utf-16, but doesn't
   realize it until much later when their diffs are unreadable).

   For the first case, the "encoding" thing above would work fine. For
   the second case, in theory we could have an option that takes any
   file with a "text" attribute and no "encoding" attribute, and
   converts it to utf-8.

   I suspect that's opening a can of worms for false positives similar
   to core.autocrlf. And performance drops as we try to guess the
   encoding and convert all incoming data.

   So I mention it mostly as a direction I think we probably _don't_
   want to go. Anybody with the "this could have been utf-8 all along"
   type of file can remedy it by converting and committing the result.

Omitting all of the "we shouldn't do this" bullet points, it seems
pretty simple and sane to me.

There is one other approach, which is to really store utf-16 in the
repository and better teach the diff tools to handle it (which are
really the main thing in git that cares about looking into the blob
contents). You can do this already with a textconv filter, but:

  1. It's slow (though cacheable).

  2. It doesn't work unless each repo configures the filter (so not on
     sites like GitHub, unless we define a micro-format that diff=utf16
     should be textconv'd on display, and get all implementations to
     respect that).

  3. Textconv patches look good, but can't be applied. This occasionally
     makes things awkward, depending on your workflow.

  4. You have to actually mark each file with an attribute, which is
     slightly annoying and more thing to remember to do (I see this from
     the server side, since people often commit utf-16 without any
     attributes, and then get annoyed when they see the file marked as
     binary).

We've toyed with the idea at GitHub of auto-detecting UTF-16 BOMs and
doing an "auto-textconv" to utf-8 (for our human-readable diffs only, of
course). That solves (1), (2), and (4), but leaves (3). I actually
looked into using libicu to do it not just for UTF-16, but to detect any
encoding. It turned out to be really slow, though. :)

So anyway, that is an alternate strategy, but I think I like "canonical
in-repo text is utf-8" approach a lot more, since then git operations
work consistently. There are still a few rough edges (e.g., I'm not sure
if you could apply a utf-8 patch directly to a utf-16 working tree file.
Certainly not using "patch", but I'm not sure how well "git apply" would
handle that case either). But I think it would mostly Just Work as long
as people were willing to set their encoding attributes.

-Peff