> On 24 Nov 2017, at 19:04, Jeff King <peff@xxxxxxxx> wrote: > > On Thu, Nov 23, 2017 at 04:18:59PM +0100, Lars Schneider wrote: > >> Alternatively, I could add a native attribute to Git that translates UTF-16 >> to UTF-8 and back. A conversion function is already available in "mingw.h" [3] >> on Windows. Limiting this feature to Windows wouldn't be a problem from my >> point of view as UTF-16 is only relevant on Windows anyways. The attribute >> could look like this: >> >> *.txt text encoding=utf-16 >> >> There was a previous discussion on the topic and Jonathan already suggested >> a "native" clean/smudge filter in 2010 [4]. Also the "encoding" attribute >> is already present but, as far as I can tell, is only used by the git gui >> for viewing [5]. > > I would not want to see a proliferation of built-in filters, but it > really seems like text-encoding conversion is a broad and practical one > that many people might benefit from. So just like line-ending > conversion, which _could_ be done by generic filters, it makes sense to > me to support it natively for speed and simplicity. > >> Do you think a patch that converts UTF-16 files to UTF-8 via an attribute >> "encoding=utf-16" on Windows would have a chance to get accepted? > > You haven't fully specified the semantics here, so let me sketch out > what I think it ought to look like: Thank you :-) > - declare utf8 the "canonical" in-repo representation, just as we have > declared LF for line endings (alternatively this could be > configurable, but if we can get away with declaring utf8 the one true > encoding, that cuts out a lot of corner cases). 100% agree on UTF-8 as canonical representation and I wouldn't make that configurable as it would open a can of worms. > - if core.convertEncoding is true, then for any file with an > encoding=foo attribute, internally run iconv(foo, utf8) in > convert_to_git(), and likewise iconv(utf8, foo) in > convert_to_working_tree. > > - I'm not sure if core.convertEncoding should be enabled by default. If > it's a noop as long as there's no encoding attribute, then it's > probably fine. But I would not want accidental conversion or any > slowdown for the common case that the user wants no conversion. I think we should mimic the behavior of "eol=crlf/lf" attribute. AFAIK whenever I set "*.ext text eol=crlf", then I can be sure the file is checked out with CRLF independent of any of my local config settings. Isn't that correct? I would expect a similar behavior if "*.ext text encoding=utf16" is set. Wouldn't that mean that we do not need a "core.convertEncoding" config? I do 100% agree that it must be a noop if there is no encoding attribute present. > - I doubt we'd want a "core.autoEncoding" similar to "core.autocrlf". I > don't think people consistently have all utf-16 files (the way they > might have all CRLF files) rather a few files that must be utf-16. Agreed! > - I have actually seen two types of utf-16 in git repos in the wild: > files which really must be utf-16 (because some tool demands it) and > files which happen to be utf-16, but could just as easily be utf-8 > (and the user simply does not notice and commits utf-16, but doesn't > realize it until much later when their diffs are unreadable). > > For the first case, the "encoding" thing above would work fine. For > the second case, in theory we could have an option that takes any > file with a "text" attribute and no "encoding" attribute, and > converts it to utf-8. TBH I would label that a "non-goal". Because AFAIK we cannot reliability detect the encoding automatically. > I suspect that's opening a can of worms for false positives similar > to core.autocrlf. And performance drops as we try to guess the > encoding and convert all incoming data. > > So I mention it mostly as a direction I think we probably _don't_ > want to go. Anybody with the "this could have been utf-8 all along" > type of file can remedy it by converting and committing the result. 100 % agree. > Omitting all of the "we shouldn't do this" bullet points, it seems > pretty simple and sane to me. > > There is one other approach, which is to really store utf-16 in the > repository and better teach the diff tools to handle it (which are > really the main thing in git that cares about looking into the blob > contents). You can do this already with a textconv filter, but: > > 1. It's slow (though cacheable). > > 2. It doesn't work unless each repo configures the filter (so not on > sites like GitHub, unless we define a micro-format that diff=utf16 > should be textconv'd on display, and get all implementations to > respect that). Actually, rendering diffs on Git hosting sites such as GitHub is one of my goals. Therefore, storing content as UTF-16 wouldn't be a solution for me. > 3. Textconv patches look good, but can't be applied. This occasionally > makes things awkward, depending on your workflow. TBH I dont't understand what you mean here. What do you mean with "textconv patches"? > 4. You have to actually mark each file with an attribute, which is > slightly annoying and more thing to remember to do (I see this from > the server side, since people often commit utf-16 without any > attributes, and then get annoyed when they see the file marked as > binary). > > We've toyed with the idea at GitHub of auto-detecting UTF-16 BOMs and > doing an "auto-textconv" to utf-8 (for our human-readable diffs only, of > course). That solves (1), (2), and (4), but leaves (3). I actually > looked into using libicu to do it not just for UTF-16, but to detect any > encoding. It turned out to be really slow, though. :) > > So anyway, that is an alternate strategy, but I think I like "canonical > in-repo text is utf-8" approach a lot more, since then git operations > work consistently. There are still a few rough edges (e.g., I'm not sure > if you could apply a utf-8 patch directly to a utf-16 working tree file. > Certainly not using "patch", but I'm not sure how well "git apply" would > handle that case either). But I think it would mostly Just Work as long > as people were willing to set their encoding attributes. Agreed! Thanks for your thoughts! I'll try to come up with a v1 of this idea. - Lars