Re: [PATCH v6 5/7] convert: add 'working-tree-encoding' attribute

Torsten Bögershausen <tboegi@xxxxxx> · Sat, 10 Feb 2018 10:48:52 +0100

On Fri, Feb 09, 2018 at 02:28:28PM +0100, lars.schneider@xxxxxxxxxxxx wrote:
> From: Lars Schneider <larsxschneider@xxxxxxxxx>
> 
> Git recognizes files encoded with ASCII or one of its supersets (e.g.
> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
> interpreted as binary and consequently built-in Git text processing
> tools (e.g. 'git diff') as well as most Git web front ends do not
> visualize the content.
> 
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.
> 
> Signed-off-by: Lars Schneider <larsxschneider@xxxxxxxxx>
> ---
>  Documentation/gitattributes.txt  |  66 ++++++++++++
>  convert.c                        | 157 ++++++++++++++++++++++++++++-
>  convert.h                        |   1 +
>  sha1_file.c                      |   2 +-
>  t/t0028-working-tree-encoding.sh | 210 +++++++++++++++++++++++++++++++++++++++
>  5 files changed, 434 insertions(+), 2 deletions(-)
>  create mode 100755 t/t0028-working-tree-encoding.sh
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 30687de81a..4ecdcd4859 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -272,6 +272,72 @@ few exceptions.  Even though...
>    catch potential problems early, safety triggers.
>  
>  
> +`working-tree-encoding`
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Git recognizes files encoded with ASCII or one of its supersets (e.g.
> +UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
> +interpreted as binary and consequently built-in Git text processing
> +tools (e.g. 'git diff') as well as most Git web front ends do not
> +visualize the content.
> +
> +In these cases you can tell Git the encoding of a file in the working
> +directory with the `working-tree-encoding` attribute. If a file with this
> +attribute is added to Git, then Git reencodes the content from the
> +specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
> +content in its internal data structure (called "the index"). On checkout
> +the content is reencoded back to the specified encoding.
> +
> +Please note that using the `working-tree-encoding` attribute may have a
> +number of pitfalls:
> +
> +- Git clients that do not support the `working-tree-encoding` attribute

A client to Git ?
Or may be "third party Git implementations"

> +  will checkout the respective files UTF-8 encoded and not in the
> +  expected encoding. Consequently, these files will appear different
> +  which typically causes trouble. This is in particular the case for
> +  older Git versions and alternative Git implementations such as JGit
> +  or libgit2 (as of February 2018).
> +
> +- Reencoding content requires resources that might slow down certain
> +  Git operations (e.g 'git checkout' or 'git add').
> +
> +Use the `working-tree-encoding` attribute only if you cannot store a file
> +in UTF-8 encoding and if you want Git to be able to process the content
> +as text.
> +
> +As an example, use the following attributes if your '*.proj' files are
> +UTF-16 encoded with byte order mark (BOM) and you want Git to perform
> +automatic line ending conversion based on your platform.
> +
> +------------------------
> +*.proj		text working-tree-encoding=UTF-16
> +------------------------
> +
> +Use the following attributes if your '*.proj' files are UTF-16 little
> +endian encoded without BOM and you want Git to use Windows line endings
> +in the working directory. Please note, it is highly recommended to
> +explicitly define the line endings with `eol` if the `working-tree-encoding`
> +attribute is used to avoid ambiguity.
> +
> +------------------------
> +*.proj 		working-tree-encoding=UTF-16LE text eol=CRLF
> +------------------------
> +
> +You can get a list of all available encodings on your platform with the
> +following command:

One question:
 +*.proj		text working-tree-encoding=UTF-16
vs
*.proj 		working-tree-encoding=UTF-16LE text eol=CRLF

Technically the order of attributes doesn't matter, but that is not what we
want to demonstrate here and now.
I would probably move the "text" attribute to the end of the line.
So that readers don't start to wonder if the order is important.

> +
> +------------------------
> +iconv --list
> +------------------------
> +
> +If you do not know the encoding of a file, then you can use the `file`
> +command to guess the encoding:
> +
> +------------------------
> +file foo.proj
> +------------------------
> +
> +
>  `ident`
>  ^^^^^^^
>  
> diff --git a/convert.c b/convert.c
> index b976eb968c..dc9e2db6b5 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -7,6 +7,7 @@
>  #include "sigchain.h"
>  #include "pkt-line.h"
>  #include "sub-process.h"
> +#include "utf8.h"
>  
>  /*
>   * convert.c - convert a file when checking it out and checking it in.
> @@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>  
>  }
>  
> +static struct encoding {
> +	const char *name;
> +	struct encoding *next;
> +} *encoding, **encoding_tail;
> +static const char *default_encoding = "UTF-8";
> +
> +static int encode_to_git(const char *path, const char *src, size_t src_len,
> +			 struct strbuf *buf, struct encoding *enc, int conv_flags)
> +{
> +	char *dst;
> +	int dst_len;
> +
> +	/*
> +	 * No encoding is specified or there is nothing to encode.
> +	 * Tell the caller that the content was not modified.
> +	 */
> +	if (!enc || (src && !src_len))
> +		return 0;
> +
> +	/*
> +	 * Looks like we got called from "would_convert_to_git()".
> +	 * This means Git wants to know if it would encode (= modify!)
> +	 * the content. Let's answer with "yes", since an encoding was
> +	 * specified.
> +	 */
> +	if (!buf && !src)
> +		return 1;
> +
> +	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
> +		const char *error_msg = _(
> +			"BOM is prohibited for '%s' if encoded as %s");
> +		const char *advise_msg = _(
> +			"You told Git to treat '%s' as %s. A byte order mark "
> +			"(BOM) is prohibited with this encoding. Either use "
> +			"%.6s as working tree encoding or remove the BOM from the "
> +			"file.");

"You told Git" is probly right from Gits point of view, and advises are really helpfull.
But what should the user do about it ?
Could we give a better advise ?

"A byte order mark (BOM) is prohibited with %s.
Please remove the BOM from the file %s 
or use "%s as working-tree-encoding"

I would probably suspect that a tool wrote the BOM, and that is
good and can or should not be changed by a user.

So a simply message like this could be the preferred (and only)
solution for a user:
"A byte order mark (BOM) is prohibited with %s.
Please use "%s as working-tree-encoding"

(And why %.6s and not simply %s ?)

No more comments for now, I didn't review the test cases.