Re: [PATCH v7 0/7] convert: add support for different encodings

Lars Schneider <larsxschneider@xxxxxxxxx> · Fri, 16 Feb 2018 15:42:35 +0100

> On 15 Feb 2018, at 21:03, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> 
> lars.schneider@xxxxxxxxxxxx writes:
> 
>> -- Git clients that do not support the `working-tree-encoding` attribute
>> -  will checkout the respective files UTF-8 encoded and not in the
>> -  expected encoding. Consequently, these files will appear different
>> -  which typically causes trouble. This is in particular the case for
>> -  older Git versions and alternative Git implementations such as JGit
>> -  or libgit2 (as of February 2018).
>> +- Third party Git implementations that do not support the
>> +  `working-tree-encoding` attribute will checkout the respective files
>> +  UTF-8 encoded and not in the expected encoding. Consequently, these
>> +  files will appear different which typically causes trouble. This is
>> +  in particular the case for older Git versions and alternative Git
>> +  implementations such as JGit or libgit2 (as of February 2018).
> 
> I know somebody found "clients" misleading in the original, but the
> ones that do not understand w-t-e do not have to be third party
> reimplementations and imitations.  All existing Git implementations,
> including ours, don't.

Agreed!

> One thing I find more problematic is that the above places *too*
> much stress on the UTF-8 centric worldview.  It is perfectly valid
> to store your text contents encoded in ShiftJIS and check them out
> as-is, with or without this patch.  It is grossly misleading to say
> that older versions of Git will check them out in UTF-8.  "will
> checkout these files as-is without encoding conversion" is a better
> way to say it, probably.

True. But that's not what I wanted to say in the "pitfalls" section.
If my Git client supports w-t-e and I add the ShiftJIS encoded
file "foo.bar" to my repository, then Git will store the file as
UTF-8 _internally_. That means if you clone my repository and your 
Git client does _not_ support w-t-e, then you will see "foo.bar" as 
UTF-8 encoded.

> Also notice that even in the world with w-t-e, such a project won't
> benefit from w-t-e at all.  After all, they have been happy using
> ShiftJIS in repository and using the same encoding on the working
> tree, and because w-t-e assumes that everybody should be using UTF-8
> internally, such a project cannot take advantage of the new
> mechanism.

Agreed. However, people using ShiftJIS are not my target audience.
My target audience are:

(1) People that have to encode their text files in UTF-16 (for 
    whatever reason - usually because of legacy processes or tools).

(2) People that want to see textual diffs of their UTF-16 encoded 
    files in their Git tools without special adjustments (on the 
    command line, on the web, etc).

That was my primary motivation. The fact that w-t-e supports any
other encoding too is just a nice side effect. I don't foresee people
using other w-t-encodings other than UTF-16 in my organization.

I have the suspicion that the feature could be useful for the Git
community at large. Consider this Stack Overflow question:
https://stackoverflow.com/questions/777949/can-i-make-git-recognize-a-utf-16-file-as-text

This question was viewed 42k times and there is no good solution.
I believe w-t-e could be a good solution.

With your comments in mind, I tried to improve the text like this:

    Git recognizes files encoded with ASCII or one of its supersets (e.g.
    UTF-8, ISO-8859-1, ...) as text files.  Files encoded with certain other
    encodings (e.g. UTF-16) are interpreted as binary and consequently
    built-in Git text processing tools (e.g. 'git diff') as well as most Git
    web front ends do not visualize the contents of these files by default.

    ...

    Please note that using the `working-tree-encoding` attribute may have a
    number of pitfalls:

    - Alternative Git implementations (e.g. JGit or libgit2) and older Git 
      versions (as of February 2018) do not support the `working-tree-encoding`
      attribute. If you decide to use the `working-tree-encoding` attribute
      in your repository, then it is strongly recommended to ensure that all
      clients working with the repository support it.

      If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an
      `working-tree-encoding` enabled Git client, then `foo.proj` will be
      stored as UTF-8 internally. A client without `working-tree-encoding`
      support will checkout `foo.proj` as UTF-8 encoded file. This will
      typically cause trouble for the users of this file.

      If a Git client, that does not support the `working-tree-encoding`
      attribute, adds a new file `bar.proj`, then `bar.proj` will be
      stored "as-is" internally (in this example probably as UTF-16). 
      A client with `working-tree-encoding` support will interpret the 
      internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
      That operation will fail and cause an error.

    ...

> And from that point of view, perhaps w-t-e attribute is somewhat
> misdesigned.
> 
> In general, an attribute is about the project's contents in the
> manner independent of platform or environment.  You define "this
> file is a C source" or "this file has JPEG image" there.  What exact
> program you use to present diffs between the two versions of such a
> file (external diff command) or what exact program you use to
> extract the textual representations (textconv filter) is environment
> and platform dependent and is left to the configuration mechanism
> for each repository.

> To be in line with the above design principle, I think the attribute
> ought to be "the in-tree contents of this path is encoded in ..."
> whose values could be things like UTF-8, ShiftJIS, etc.  What
> external encoding the paths should be checked out is not a
> project-wide matter, especially when talking about cross platform
> projects.  Perhaps a project in Japanese language wants to check
> out its contents in EUC-jp on Unices and in ShiftJIS on DOS derived
> systems.  The participants all need to know what in-repository
> encoding is used, which is a sensible use of attributes.  They also
> need to know what the recommended external encoding to be used in
> the working tree is for their platforms, but that is more like what
> Makefile variable to set for their platforms, etc., and is not a
> good match to the attributes system.

As mentioned above, this is not my intended usecase here. As Peff
mentioned elsewhere "always check this out in _this_ encoding"
is the goal here.

Thanks a lot for feedback,
Lars