Re: [PATCH v2 4/5] convert: add support for 'checkout-encoding' attribute

Torsten Bögershausen <tboegi@xxxxxx> · Fri, 29 Dec 2017 18:28:58 +0100

I probably need to look at convert.c more closer, some other comments inline.

On Fri, Dec 29, 2017 at 04:22:21PM +0100, lars.schneider@xxxxxxxxxxxx wrote:
> From: Lars Schneider <larsxschneider@xxxxxxxxx>
> 
> Git and its tools (e.g. git diff) expect all text files in UTF-8
> encoding. Git will happily accept content in all other encodings, too,
> but it might not be able to process the text (e.g. viewing diffs or
> changing line endings).

UTF-8 is too strict, the text from below is more correct:
 +Git recognizes files encoded with ASCII or one of its supersets (e.g.
 +UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
 +interpreted as binary and consequently built-in Git text processing
 +tools (e.g. 'git diff') as well as most Git web front ends do not
 +visualize the content.

> 
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will

Minor question about "canonical":
Would this mean the same ?
 ...then Git converts the content into  UTF-8.

> reverse the conversion.
> 
> Signed-off-by: Lars Schneider <larsxschneider@xxxxxxxxx>
> ---
>  Documentation/gitattributes.txt |  59 ++++++++++++
>  apply.c                         |   2 +-
>  blame.c                         |   2 +-
>  combine-diff.c                  |   2 +-
>  convert.c                       | 196 ++++++++++++++++++++++++++++++++++++++-
>  convert.h                       |   8 +-
>  diff.c                          |   2 +-
>  sha1_file.c                     |   5 +-
>  t/t0028-checkout-encoding.sh    | 197 ++++++++++++++++++++++++++++++++++++++++
>  9 files changed, 460 insertions(+), 13 deletions(-)
>  create mode 100755 t/t0028-checkout-encoding.sh
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 30687de81a..0039bd38c3 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -272,6 +272,65 @@ few exceptions.  Even though...
>    catch potential problems early, safety triggers.
>  
>  
> +`checkout-encoding`
> +^^^^^^^^^^^^^^^^^^^
> +
> +Git recognizes files encoded with ASCII or one of its supersets (e.g.
> +UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
> +interpreted as binary and consequently built-in Git text processing
> +tools (e.g. 'git diff') as well as most Git web front ends do not
> +visualize the content.
> +

> +In these cases you can teach Git the encoding of a file in the working
                          teach ? tell ? 
> +directory with the `checkout-encoding` attribute. If a file with this
> +attributes is added to Git, then Git reencodes the content from the
> +specified encoding to UTF-8 and stores the result in its internal data
> +structure.

Minor Q:
> its internal data structure.
Should we simply write "stores the result in the index" ?

> On checkout the content is encoded back to the specified
> +encoding.

> +
> +Please note that using the `checkout-encoding` attribute has a number
> +of drawbacks:
> +
> +- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
> +  errors as the conversion might not be round trip safe.
> +
> +- Reencoding content requires resources that might slow down certain
> +  Git operations (e.g 'git checkout' or 'git add').
> +
> +- Git clients that do not support the `checkout-encoding` attribute or
> +  the used encoding will checkout the respective files as UTF-8 encoded.
> +  That means the content appears to be different which could cause
> +  trouble. Affected clients are older Git versions and alternative Git
> +  implementations such as JGit or libgit2 (as of January 2018).
> +
> +Use the `checkout-encoding` attribute only if you cannot store a file in
> +UTF-8 encoding and if you want Git to be able to process the content as
> +text.
> +

I would maybe rephrase a little bit (first things first):

Please note that using the `checkout-encoding` attribute may have a number
of pitfalls:

- Git clients that do not support the `checkout-encoding` attribute
  will checkout the respective files as UTF-8 encoded,  which typically
  causes trouble.
  Escpecialy when older Git versions are used or alternative Git
  implementations such as JGit or libgit2 (as of January 2018).

- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
  errors as the conversion might not be round trip safe.

- Reencoding content requires resources that might slow down certain
  Git operations (e.g 'git checkout' or 'git add').

Use the `checkout-encoding` attribute only if you cannot store a file in
UTF-8 encoding and if you want Git to be able to process the content as
text.

-----
Side question: What happens if "the used encoding" is not supported by
the underlying iconv lib ?
Will Git fail, delete the file from the working tree ?
That may be worth to mention. (And I need to do the code-review)

> +Use the following attributes if your '*.txt' files are UTF-16 encoded
> +with byte order mark (BOM) and you want Git to perform automatic line
> +ending conversion based on your platform.
> +
> +------------------------
> +*.txt		text checkout-encoding=UTF-16
> +------------------------
> +
> +Use the following attributes if your '*.txt' files are UTF-16 little
> +endian encoded without BOM and you want Git to use Windows line endings
> +in the working directory.
> +
> +------------------------
> +*.txt 		checkout-encoding=UTF-16LE text eol=CRLF
> +------------------------
> +
> +You can get a list of all available encodings on your platform with the
> +following command:
> +
> +------------------------
> +iconv --list
> +------------------------
> +
> +
>  `ident`
>  ^^^^^^^
>  
> diff --git a/apply.c b/apply.c
> index 321a9fa68d..c4bd5cf1f2 100644
> --- a/apply.c
> +++ b/apply.c
> @@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch *patch,
>  		 * should never look at the index when explicit crlf option
>  		 * is given.
>  		 */
> -		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf);
> +		convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf, 0);

Hm. Do we really need another parameter here?
The only caller that needs it (and sets it to 1) is sha1_file.c.
When an invalid encoding/content is used, it should be safe to die() always?
Just loud thinking..

If really needed, it may need less changes in the code base, if a new function
convert_to_git_or_die() is defined, which is called by convert_to_git()

(and the other alternative would be to convert safe_crlf into a bitmap,
 see https://public-inbox.org/git/20171229132828.17594-1-tboegi@xxxxxx/T/#u
 what do you think ?)

>  		return 0;
>  	default:
>  		return -1;
> diff --git a/blame.c b/blame.c
> index 2893f3c103..388b66897b 100644
> --- a/blame.c
> +++ b/blame.c
> @@ -229,7 +229,7 @@ static struct commit *fake_working_tree_commit(struct diff_options *opt,
>  		if (strbuf_read(&buf, 0, 0) < 0)
>  			die_errno("failed to read from stdin");
>  	}
> -	convert_to_git(&the_index, path, buf.buf, buf.len, &buf, 0);
> +	convert_to_git(&the_index, path, buf.buf, buf.len, &buf, 0, 0);
>  	origin->file.ptr = buf.buf;
>  	origin->file.size = buf.len;
>  	pretend_sha1_file(buf.buf, buf.len, OBJ_BLOB, origin->blob_oid.hash);
> diff --git a/combine-diff.c b/combine-diff.c
> index 2505de119a..4555e49b5f 100644
> --- a/combine-diff.c
> +++ b/combine-diff.c
> @@ -1053,7 +1053,7 @@ static void show_patch_diff(struct combine_diff_path *elem, int num_parent,
>  			if (is_file) {
>  				struct strbuf buf = STRBUF_INIT;
>  
> -				if (convert_to_git(&the_index, elem->path, result, len, &buf, safe_crlf)) {
> +				if (convert_to_git(&the_index, elem->path, result, len, &buf, safe_crlf, 0)) {
>  					free(result);
>  					result = strbuf_detach(&buf, &len);
>  					result_size = len;
> diff --git a/convert.c b/convert.c
> index 20d7ab67bd..fc8c96b670 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -7,6 +7,7 @@
>  #include "sigchain.h"
>  #include "pkt-line.h"
>  #include "sub-process.h"
> +#include "utf8.h"
>  
>  /*
>   * convert.c - convert a file when checking it out and checking it in.
> @@ -256,6 +257,147 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>  
>  }
>  
> +static struct encoding {
> +	const char *name;
> +	struct encoding *next;
> +} *encoding, **encoding_tail;
> +static const char *default_encoding = "UTF-8";
> +
> +static int encode_to_git(const char *path, const char *src, size_t src_len,
> +			 struct strbuf *buf, struct encoding *enc, int write_obj)
> +{
> +	char *dst;
> +	int dst_len;
> +
> +	/*
> +	 * No encoding is specified or there is nothing to encode.
> +	 * Tell the caller that the content was not modified.
> +	 */
> +	if (!enc || (src && !src_len))
> +		return 0;
> +
> +	/*
> +	 * Looks like we got called from "would_convert_to_git()".
> +	 * This means Git wants to know if it would encode (= modify!)
> +	 * the content. Let's answer with "yes", since an encoding was
> +	 * specified.
> +	 */
> +	if (!buf && !src)
> +		return 1;
> +
> +	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
> +		const char *error_msg = _(
> +			"BOM is prohibited for '%s' if encoded as %s");
> +		const char *advise_msg = _(
> +			"You told Git to treat '%s' as %s. A byte order mark "
> +			"(BOM) is prohibited with this encoding. Either use "
> +			"%.6s as checkout encoding or remove the BOM from the "
> +			"file.");
> +
> +		advise(advise_msg, path, enc->name, enc->name, enc->name);
> +		if (write_obj)
> +			die(error_msg, path, enc->name);
> +		else
> +			error(error_msg, path, enc->name);

As said before, I would just die(). Or do I miss something ?
Being strict with BOMs seams like a good idea.

> +
> +
> +	} else if (has_missing_utf_bom(enc->name, src, src_len)) {
> +		const char *error_msg = _(
> +			"BOM is required for '%s' if encoded as %s");
> +		const char *advise_msg = _(
> +			"You told Git to treat '%s' as %s. A byte order mark "
> +			"(BOM) is required with this encoding. Either use "
> +			"%sBE/%sLE as checkout encoding or add a BOM to the "
> +			"file.");
> +		advise(advise_msg, path, enc->name, enc->name, enc->name);
> +		if (write_obj)
> +			die(error_msg, path, enc->name);
> +		else
> +			error(error_msg, path, enc->name);
> +	}
> +
> +	dst = reencode_string_len(src, src_len, default_encoding, enc->name,
> +				  &dst_len);
> +	if (!dst) {
> +		/*
> +		 * We could add the blob "as-is" to Git. However, on checkout
> +		 * we would try to reencode to the original encoding. This
> +		 * would fail and we would leave the user with a messed-up
> +		 * working tree. Let's try to avoid this by screaming loud.
> +		 */
> +		const char* msg = _("failed to encode '%s' from %s to %s");
> +		if (write_obj)
> +			die(msg, path, enc->name, default_encoding);
> +		else
> +			error(msg, path, enc->name, default_encoding);
> +	}
> +
> +	/*
> +	 * UTF supports lossless round tripping [1]. UTF to other encoding are
> +	 * mostly round trip safe as Unicode aims to be a superset of all other
> +	 * character encodings. However, the SHIFT-JIS (Japanese character set)
> +	 * is an exception as some codes are not round trip safe [2].
> +	 *
> +	 * Reverse the transformation of 'dst' and check the result with 'src'
> +	 * if content is written to Git. This ensures no information is lost
> +	 * during conversion to/from UTF-8.
> +	 *
> +	 * Please note, the code below is not tested because I was not able to
> +	 * generate a faulty round trip without iconv error.
> +	 *
> +	 * [1] http://unicode.org/faq/utf_bom.html#gen2
> +	 * [2] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode
> +	 */
> +	if (write_obj && !strcmp(enc->name, "SHIFT-JIS")) {
> +		char *re_src;
> +		int re_src_len;
> +
> +		re_src = reencode_string_len(dst, dst_len,
> +					     enc->name, default_encoding,
> +					     &re_src_len);
> +
> +		if (!re_src || src_len != re_src_len ||
> +		    memcmp(src, re_src, src_len)) {
> +			const char* msg = _("encoding '%s' from %s to %s and "
> +					    "back is not the same");
> +			if (write_obj)
> +				die(msg, path, enc->name, default_encoding);
> +			else
> +				error(msg, path, enc->name, default_encoding);
> +		}
> +
> +		free(re_src);
> +	}
> +
> +	strbuf_attach(buf, dst, dst_len, dst_len + 1);
> +	return 1;
> +}
> +
> +static int encode_to_worktree(const char *path, const char *src, size_t src_len,
> +			      struct strbuf *buf, struct encoding *enc)
> +{
> +	char *dst;
> +	int dst_len;
> +
> +	/*
> +	 * No encoding is specified or there is nothing to encode.
> +	 * Tell the caller that the content was not modified.
> +	 */
> +	if (!enc || (src && !src_len))
> +		return 0;
> +
> +	dst = reencode_string_len(src, src_len, enc->name, default_encoding,
> +				  &dst_len);
> +	if (!dst) {
> +		error("failed to encode '%s' from %s to %s",
> +			path, enc->name, default_encoding);
> +		return 0;
> +	}
> +
> +	strbuf_attach(buf, dst, dst_len, dst_len + 1);
> +	return 1;
> +}
> +
>  static int crlf_to_git(const struct index_state *istate,
>  		       const char *path, const char *src, size_t len,
>  		       struct strbuf *buf,
> @@ -969,6 +1111,31 @@ static int ident_to_worktree(const char *path, const char *src, size_t len,
>  	return 1;
>  }
>  
> +static struct encoding *git_path_check_encoding(struct attr_check_item *check)
> +{
> +	const char *value = check->value;
> +	struct encoding *enc;
> +
> +	if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
> +	    !strlen(value))
> +		return NULL;
> +
> +	for (enc = encoding; enc; enc = enc->next)
> +		if (!strcasecmp(value, enc->name))
> +			return enc;
> +
> +	/* Don't encode to the default encoding */
> +	if (!strcasecmp(value, default_encoding))
> +		return NULL;
> +
> +	enc = xcalloc(1, sizeof(struct convert_driver));
> +	enc->name = xstrdup_toupper(value);  /* aways use upper case names! */
> +	*encoding_tail = enc;
> +	encoding_tail = &(enc->next);
> +
> +	return enc;
> +}
> +
>  static enum crlf_action git_path_check_crlf(struct attr_check_item *check)
>  {
>  	const char *value = check->value;
> @@ -1024,6 +1191,7 @@ struct conv_attrs {
>  	enum crlf_action attr_action; /* What attr says */
>  	enum crlf_action crlf_action; /* When no attr is set, use core.autocrlf */
>  	int ident;
> +	struct encoding *checkout_encoding; /* Supported encoding or default encoding if NULL */
>  };
>  
>  static void convert_attrs(struct conv_attrs *ca, const char *path)
> @@ -1032,8 +1200,10 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
>  
>  	if (!check) {
>  		check = attr_check_initl("crlf", "ident", "filter",
> -					 "eol", "text", NULL);
> +					 "eol", "text", "checkout-encoding",
> +					 NULL);
>  		user_convert_tail = &user_convert;
> +		encoding_tail = &encoding;
>  		git_config(read_convert_config, NULL);
>  	}
>  
> @@ -1055,6 +1225,7 @@ static void convert_attrs(struct conv_attrs *ca, const char *path)
>  			else if (eol_attr == EOL_CRLF)
>  				ca->crlf_action = CRLF_TEXT_CRLF;
>  		}
> +		ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
>  	} else {
>  		ca->drv = NULL;
>  		ca->crlf_action = CRLF_UNDEFINED;
> @@ -1120,7 +1291,7 @@ const char *get_convert_attr_ascii(const char *path)
>  
>  int convert_to_git(const struct index_state *istate,
>  		   const char *path, const char *src, size_t len,
> -                   struct strbuf *dst, enum safe_crlf checksafe)
> +		   struct strbuf *dst, enum safe_crlf checksafe, int write_obj)
>  {
>  	int ret = 0;
>  	struct conv_attrs ca;
> @@ -1135,6 +1306,13 @@ int convert_to_git(const struct index_state *istate,
>  		src = dst->buf;
>  		len = dst->len;
>  	}
> +
> +	ret |= encode_to_git(path, src, len, dst, ca.checkout_encoding, write_obj);
> +	if (ret && dst) {
> +		src = dst->buf;
> +		len = dst->len;
> +	}
> +
>  	if (checksafe != SAFE_CRLF_KEEP_CRLF) {
>  		ret |= crlf_to_git(istate, path, src, len, dst, ca.crlf_action, checksafe);
>  		if (ret && dst) {
> @@ -1147,7 +1325,7 @@ int convert_to_git(const struct index_state *istate,
>  
>  void convert_to_git_filter_fd(const struct index_state *istate,
>  			      const char *path, int fd, struct strbuf *dst,
> -			      enum safe_crlf checksafe)
> +			      enum safe_crlf checksafe, int write_obj)
>  {
>  	struct conv_attrs ca;
>  	convert_attrs(&ca, path);
> @@ -1158,6 +1336,7 @@ void convert_to_git_filter_fd(const struct index_state *istate,
>  	if (!apply_filter(path, NULL, 0, fd, dst, ca.drv, CAP_CLEAN, NULL))
>  		die("%s: clean filter '%s' failed", path, ca.drv->name);
>  
> +	encode_to_git(path, dst->buf, dst->len, dst, ca.checkout_encoding, write_obj);
>  	crlf_to_git(istate, path, dst->buf, dst->len, dst, ca.crlf_action, checksafe);
>  	ident_to_git(path, dst->buf, dst->len, dst, ca.ident);
>  }
> @@ -1189,6 +1368,12 @@ static int convert_to_working_tree_internal(const char *path, const char *src,
>  		}
>  	}
>  
> +	ret |= encode_to_worktree(path, src, len, dst, ca.checkout_encoding);
> +	if (ret) {
> +		src = dst->buf;
> +		len = dst->len;
> +	}
> +
>  	ret_filter = apply_filter(
>  		path, src, len, -1, dst, ca.drv, CAP_SMUDGE, dco);
>  	if (!ret_filter && ca.drv && ca.drv->required)
> @@ -1217,7 +1402,7 @@ int renormalize_buffer(const struct index_state *istate, const char *path,
>  		src = dst->buf;
>  		len = dst->len;
>  	}
> -	return ret | convert_to_git(istate, path, src, len, dst, SAFE_CRLF_RENORMALIZE);
> +	return ret | convert_to_git(istate, path, src, len, dst, SAFE_CRLF_RENORMALIZE, 0);
>  }
>  
>  /*****************************************************************
> @@ -1655,6 +1840,9 @@ struct stream_filter *get_stream_filter(const char *path, const unsigned char *s
>  	if (ca.drv && (ca.drv->process || ca.drv->smudge || ca.drv->clean))
>  		return NULL;
>  
> +	if (ca.checkout_encoding)
> +		return NULL;
> +
>  	if (ca.crlf_action == CRLF_AUTO || ca.crlf_action == CRLF_AUTO_CRLF)
>  		return NULL;
>  
> diff --git a/convert.h b/convert.h
> index 4f2da225a8..9e4e884ec1 100644
> --- a/convert.h
> +++ b/convert.h
> @@ -66,7 +66,8 @@ extern const char *get_convert_attr_ascii(const char *path);
>  /* returns 1 if *dst was used */
>  extern int convert_to_git(const struct index_state *istate,
>  			  const char *path, const char *src, size_t len,
> -			  struct strbuf *dst, enum safe_crlf checksafe);
> +			  struct strbuf *dst, enum safe_crlf checksafe,
> +			  int write_obj);
>  extern int convert_to_working_tree(const char *path, const char *src,
>  				   size_t len, struct strbuf *dst);
>  extern int async_convert_to_working_tree(const char *path, const char *src,
> @@ -79,13 +80,14 @@ extern int renormalize_buffer(const struct index_state *istate,
>  static inline int would_convert_to_git(const struct index_state *istate,
>  				       const char *path)
>  {
> -	return convert_to_git(istate, path, NULL, 0, NULL, 0);
> +	return convert_to_git(istate, path, NULL, 0, NULL, 0, 0);
>  }
>  /* Precondition: would_convert_to_git_filter_fd(path) == true */
>  extern void convert_to_git_filter_fd(const struct index_state *istate,
>  				     const char *path, int fd,
>  				     struct strbuf *dst,
> -				     enum safe_crlf checksafe);
> +				     enum safe_crlf checksafe,
> +				     int write_obj);
>  extern int would_convert_to_git_filter_fd(const char *path);
>  
>  /*****************************************************************
> diff --git a/diff.c b/diff.c
> index 2ebe2227b4..16ca0bf0df 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -3599,7 +3599,7 @@ int diff_populate_filespec(struct diff_filespec *s, unsigned int flags)
>  		/*
>  		 * Convert from working tree format to canonical git format
>  		 */
> -		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, crlf_warn)) {
> +		if (convert_to_git(&the_index, s->path, s->data, s->size, &buf, crlf_warn, 0)) {
>  			size_t size = 0;
>  			munmap(s->data, s->size);
>  			s->should_munmap = 0;
> diff --git a/sha1_file.c b/sha1_file.c
> index afe4b90f6e..75800248d2 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -1694,7 +1694,8 @@ static int index_mem(struct object_id *oid, void *buf, size_t size,
>  	if ((type == OBJ_BLOB) && path) {
>  		struct strbuf nbuf = STRBUF_INIT;
>  		if (convert_to_git(&the_index, path, buf, size, &nbuf,
> -				   get_safe_crlf(flags))) {
> +				   get_safe_crlf(flags),
> +				   write_object)) {
>  			buf = strbuf_detach(&nbuf, &size);
>  			re_allocated = 1;
>  		}
> @@ -1728,7 +1729,7 @@ static int index_stream_convert_blob(struct object_id *oid, int fd,
>  	assert(would_convert_to_git_filter_fd(path));
>  
>  	convert_to_git_filter_fd(&the_index, path, fd, &sbuf,
> -				 get_safe_crlf(flags));
> +				 get_safe_crlf(flags), write_object);
>  
>  	if (write_object)
>  		ret = write_sha1_file(sbuf.buf, sbuf.len, typename(OBJ_BLOB),
> diff --git a/t/t0028-checkout-encoding.sh b/t/t0028-checkout-encoding.sh
> new file mode 100755
> index 0000000000..1a329ab933
> --- /dev/null
> +++ b/t/t0028-checkout-encoding.sh
> @@ -0,0 +1,197 @@
> +#!/bin/sh
> +
> +test_description='checkout-encoding conversion via gitattributes'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup test repo' '
> +
> +	text="hallo there!\ncan you read me?" &&

Is this portable ? (the "\n")

> +
> +	echo "*.utf16 text checkout-encoding=utf-16" >.gitattributes &&
> +
> +	printf "$text" >test.utf8.raw &&
> +	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
> +	cp test.utf16.raw test.utf16 &&
> +
> +	git add .gitattributes test.utf16 &&
> +	git commit -m initial
> +'
> +
> +test_expect_success 'ensure UTF-8 is stored in Git' '
> +	git cat-file -p :test.utf16 >test.utf16.git &&
> +	test_cmp_bin test.utf8.raw test.utf16.git &&
> +	rm test.utf8.raw test.utf16.git
> +'
> +
> +test_expect_success 're-encode to UTF-16 on checkout' '
> +	rm test.utf16 &&
> +	git checkout test.utf16 &&
> +	test_cmp_bin test.utf16.raw test.utf16 &&
> +
> +	# cleanup
> +	rm test.utf16.raw
> +'
> +
> +test_expect_success 'check prohibited UTF BOM' '
> +	printf "\0a\0b\0c"                         >nobom.utf16be.raw &&
> +	printf "a\0b\0c\0"                         >nobom.utf16le.raw &&
> +	printf "\376\777\0a\0b\0c"                 >bebom.utf16be.raw &&
> +	printf "\777\376a\0b\0c\0"                 >lebom.utf16le.raw &&
> +
> +	printf "\0\0\0a\0\0\0b\0\0\0c"             >nobom.utf32be.raw &&
> +	printf "a\0\0\0b\0\0\0c\0\0\0"             >nobom.utf32le.raw &&
> +	printf "\0\0\376\777\0\0\0a\0\0\0b\0\0\0c" >bebom.utf32be.raw &&
> +	printf "\777\376\0\0a\0\0\0b\0\0\0c\0\0\0" >lebom.utf32le.raw &&
> +
> +	echo "*.utf16be text checkout-encoding=utf-16be" >>.gitattributes &&
> +	echo "*.utf16le text checkout-encoding=utf-16le" >>.gitattributes &&
> +	echo "*.utf32be text checkout-encoding=utf-32be" >>.gitattributes &&
> +	echo "*.utf32le text checkout-encoding=utf-32le" >>.gitattributes &&
> +
> +	# Here we add a UTF-16 files with BOM (big-endian and little-endian)
> +	# but we tell Git to treat it as UTF-16BE/UTF-16LE. In these cases
> +	# the BOM is prohibited.
> +	cp bebom.utf16be.raw bebom.utf16be &&
> +	test_must_fail git add bebom.utf16be 2>err.out &&
> +	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
> +
> +	cp lebom.utf16le.raw lebom.utf16be &&
> +	test_must_fail git add lebom.utf16be 2>err.out &&
> +	test_i18ngrep "fatal: BOM is prohibited .* UTF-16BE" err.out &&
> +
> +	cp bebom.utf16be.raw bebom.utf16le &&
> +	test_must_fail git add bebom.utf16le 2>err.out &&
> +	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
> +
> +	cp lebom.utf16le.raw lebom.utf16le &&
> +	test_must_fail git add lebom.utf16le 2>err.out &&
> +	test_i18ngrep "fatal: BOM is prohibited .* UTF-16LE" err.out &&
> +
> +	# ... and the same for UTF-32
> +	cp bebom.utf32be.raw bebom.utf32be &&
> +	test_must_fail git add bebom.utf32be 2>err.out &&
> +	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
> +
> +	cp lebom.utf32le.raw lebom.utf32be &&
> +	test_must_fail git add lebom.utf32be 2>err.out &&
> +	test_i18ngrep "fatal: BOM is prohibited .* UTF-32BE" err.out &&
> +
> +	cp bebom.utf32be.raw bebom.utf32le &&
> +	test_must_fail git add bebom.utf32le 2>err.out &&
> +	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
> +
> +	cp lebom.utf32le.raw lebom.utf32le &&
> +	test_must_fail git add lebom.utf32le 2>err.out &&
> +	test_i18ngrep "fatal: BOM is prohibited .* UTF-32LE" err.out &&
> +
> +	# cleanup
> +	git reset --hard HEAD
> +'
> +
> +test_expect_success 'check required UTF BOM' '
> +	echo "*.utf32 text checkout-encoding=utf-32" >>.gitattributes &&
> +
> +	cp nobom.utf16be.raw nobom.utf16 &&
> +	test_must_fail git add nobom.utf16 2>err.out &&
> +	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
> +
> +	cp nobom.utf16le.raw nobom.utf16 &&
> +	test_must_fail git add nobom.utf16 2>err.out &&
> +	test_i18ngrep "fatal: BOM is required .* UTF-16" err.out &&
> +
> +	cp nobom.utf32be.raw nobom.utf32 &&
> +	test_must_fail git add nobom.utf32 2>err.out &&
> +	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
> +
> +	cp nobom.utf32le.raw nobom.utf32 &&
> +	test_must_fail git add nobom.utf32 2>err.out &&
> +	test_i18ngrep "fatal: BOM is required .* UTF-32" err.out &&
> +
> +	# cleanup
> +	rm nobom.utf16 nobom.utf32 &&
> +	git reset --hard HEAD
> +'
> +
> +test_expect_success 'eol conversion for UTF-16 encoded files on checkout' '
> +	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
> +	printf "one\r\ntwo\r\nthree\r\n" >crlf.utf8.raw &&
> +
> +	cat lf.utf8.raw | iconv -f UTF-8 -t UTF-16 >lf.utf16.raw &&
> +	cat crlf.utf8.raw | iconv -f UTF-8 -t UTF-16 >crlf.utf16.raw &&
> +	cp crlf.utf16.raw eol.utf16 &&
> +
> +	git add eol.utf16 &&
> +	git commit -m eol &&
> +
> +	# UTF-16 with CRLF (Windows line endings)
> +	rm eol.utf16 &&
> +	git -c core.eol=crlf checkout eol.utf16 &&
> +	test_cmp_bin crlf.utf16.raw eol.utf16 &&
> +
> +	# UTF-16 with LF (Unix line endings)
> +	rm eol.utf16 &&
> +	git -c core.eol=lf checkout eol.utf16 &&
> +	test_cmp_bin lf.utf16.raw eol.utf16 &&
> +
> +	rm crlf.utf16.raw crlf.utf8.raw lf.utf16.raw lf.utf8.raw &&
> +
> +	# cleanup
> +	git reset --hard HEAD^
> +'
> +
> +test_expect_success 'check unsupported encodings' '
> +
> +	echo "*.nothing text checkout-encoding=" >>.gitattributes &&
> +	printf "nothing" >t.nothing &&
> +	git add t.nothing &&
> +
> +	echo "*.garbage text checkout-encoding=garbage" >>.gitattributes &&
> +	printf "garbage" >t.garbage &&
> +	test_must_fail git add t.garbage 2>err.out &&
> +	test_i18ngrep "fatal: failed to encode" err.out &&
> +
> +	# cleanup
> +	rm err.out &&
> +	git reset --hard HEAD
> +'
> +
> +test_expect_success 'error if encoding round trip is not the same during refresh' '
> +	BEFORE_STATE=$(git rev-parse HEAD) &&
> +
> +	# Skip the UTF-16 filter for the added file
> +	# This simulates a Git version that has no checkoutEncoding support
> +	echo "hallo" >nonsense.utf16 &&
> +	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
> +	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
> +	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
> +	git update-ref refs/heads/master $COMMIT &&
> +
> +	test_must_fail git checkout HEAD^ 2>err.out &&
> +	test_i18ngrep "error: .* overwritten by checkout:" err.out &&
> +
> +	# cleanup
> +	rm err.out &&
> +	git reset --hard $BEFORE_STATE
> +'
> +
> +test_expect_success 'error if encoding garbage is already in Git' '
> +	BEFORE_STATE=$(git rev-parse HEAD) &&
> +
> +	# Skip the UTF-16 filter for the added file
> +	# This simulates a Git version that has no checkoutEncoding support
> +	cp nobom.utf16be.raw nonsense.utf16 &&
> +	TEST_HASH=$(git hash-object --no-filters -w nonsense.utf16) &&
> +	git update-index --add --cacheinfo 100644 $TEST_HASH nonsense.utf16 &&
> +	COMMIT=$(git commit-tree -p $(git rev-parse HEAD) -m "plain commit" $(git write-tree)) &&
> +	git update-ref refs/heads/master $COMMIT &&
> +
> +	git diff 2>err.out &&
> +	test_i18ngrep "error: BOM is required" err.out &&
> +
> +	# cleanup
> +	rm err.out &&
> +	git reset --hard $BEFORE_STATE
> +'
> +
> +test_done
> -- 
> 2.15.1
>