Re: [PATCH v7 0/7] convert: add support for different encodings

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 15 Feb 2018 12:03:06 -0800

lars.schneider@xxxxxxxxxxxx writes:

> -- Git clients that do not support the `working-tree-encoding` attribute
> -  will checkout the respective files UTF-8 encoded and not in the
> -  expected encoding. Consequently, these files will appear different
> -  which typically causes trouble. This is in particular the case for
> -  older Git versions and alternative Git implementations such as JGit
> -  or libgit2 (as of February 2018).
> +- Third party Git implementations that do not support the
> +  `working-tree-encoding` attribute will checkout the respective files
> +  UTF-8 encoded and not in the expected encoding. Consequently, these
> +  files will appear different which typically causes trouble. This is
> +  in particular the case for older Git versions and alternative Git
> +  implementations such as JGit or libgit2 (as of February 2018).

I know somebody found "clients" misleading in the original, but the
ones that do not understand w-t-e do not have to be third party
reimplementations and imitations.  All existing Git implementations,
including ours, don't.

One thing I find more problematic is that the above places *too*
much stress on the UTF-8 centric worldview.  It is perfectly valid
to store your text contents encoded in ShiftJIS and check them out
as-is, with or without this patch.  It is grossly misleading to say
that older versions of Git will check them out in UTF-8.  "will
checkout these files as-is without encoding conversion" is a better
way to say it, probably.

Also notice that even in the world with w-t-e, such a project won't
benefit from w-t-e at all.  After all, they have been happy using
ShiftJIS in repository and using the same encoding on the working
tree, and because w-t-e assumes that everybody should be using UTF-8
internally, such a project cannot take advantage of the new
mechanism.

And from that point of view, perhaps w-t-e attribute is somewhat
misdesigned.

In general, an attribute is about the project's contents in the
manner independent of platform or environment.  You define "this
file is a C source" or "this file has JPEG image" there.  What exact
program you use to present diffs between the two versions of such a
file (external diff command) or what exact program you use to
extract the textual representations (textconv filter) is environment
and platform dependent and is left to the configuration mechanism
for each repository.

To be in line with the above design principle, I think the attribute
ought to be "the in-tree contents of this path is encoded in ..."
whose values could be things like UTF-8, ShiftJIS, etc.  What
external encoding the paths should be checked out is not a
project-wide matter, especially when talking about cross platform
projects.  Perhaps a project in Japanese language wants to check
out its contents in EUC-jp on Unices and in ShiftJIS on DOS derived
systems.  The participants all need to know what in-repository
encoding is used, which is a sensible use of attributes.  They also
need to know what the recommended external encoding to be used in
the working tree is for their platforms, but that is more like what
Makefile variable to set for their platforms, etc., and is not a
good match to the attributes system.