[PATCH v7 0/7] convert: add support for different encodings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



From: Lars Schneider <larsxschneider@xxxxxxxxx>

Hi,

Patches 1-4, 6 are preparation and helper functions.
Patch 5,7 are the actual change.

This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is already
in master.

Changes since v6:

* use consistent casing for core.checkRoundtripEncoding (Junio)
* fix gibberish in commit message (Junio)
* improve documentation (Torsten)
* improve advise messages (Torsten)


Thanks,
Lars

  RFC: https://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@xxxxxxxxx/
   v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schneider@xxxxxxxxxxxx/
   v2: https://public-inbox.org/git/20171229152222.39680-1-lars.schneider@xxxxxxxxxxxx/
   v3: https://public-inbox.org/git/20180106004808.77513-1-lars.schneider@xxxxxxxxxxxx/
   v4: https://public-inbox.org/git/20180120152418.52859-1-lars.schneider@xxxxxxxxxxxx/
   v5: https://public-inbox.org/git/20180129201855.9182-1-tboegi@xxxxxx/
   v6: https://public-inbox.org/git/20180209132830.55385-1-lars.schneider@xxxxxxxxxxxx/


Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/2b94bec353
Checkout: git fetch https://github.com/larsxschneider/git encoding-v7 && git checkout 2b94bec353


### Interdiff (v6..v7):

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index ea5a9509c6..10cb37795d 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -291,19 +291,20 @@ the content is reencoded back to the specified encoding.
 Please note that using the `working-tree-encoding` attribute may have a
 number of pitfalls:

-- Git clients that do not support the `working-tree-encoding` attribute
-  will checkout the respective files UTF-8 encoded and not in the
-  expected encoding. Consequently, these files will appear different
-  which typically causes trouble. This is in particular the case for
-  older Git versions and alternative Git implementations such as JGit
-  or libgit2 (as of February 2018).
+- Third party Git implementations that do not support the
+  `working-tree-encoding` attribute will checkout the respective files
+  UTF-8 encoded and not in the expected encoding. Consequently, these
+  files will appear different which typically causes trouble. This is
+  in particular the case for older Git versions and alternative Git
+  implementations such as JGit or libgit2 (as of February 2018).

 - Reencoding content to non-UTF encodings can cause errors as the
   conversion might not be UTF-8 round trip safe. If you suspect your
-  encoding to not be round trip safe, then add it to `core.checkRoundtripEncoding`
-  to make Git check the round trip encoding (see linkgit:git-config[1]).
-  SHIFT-JIS (Japanese character set) is known to have round trip issues
-  with UTF-8 and is checked by default.
+  encoding to not be round trip safe, then add it to
+  `core.checkRoundtripEncoding` to make Git check the round trip
+  encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character
+  set) is known to have round trip issues with UTF-8 and is checked by
+  default.

 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').
@@ -327,7 +328,7 @@ explicitly define the line endings with `eol` if the `working-tree-encoding`
 attribute is used to avoid ambiguity.

 ------------------------
-*.proj 		working-tree-encoding=UTF-16LE text eol=CRLF
+*.proj 		text working-tree-encoding=UTF-16LE eol=CRLF
 ------------------------

 You can get a list of all available encodings on your platform with the
diff --git a/convert.c b/convert.c
index 71dffc7167..398cd9cf7b 100644
--- a/convert.c
+++ b/convert.c
@@ -352,29 +352,29 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,

 	if (has_prohibited_utf_bom(enc->name, src, src_len)) {
 		const char *error_msg = _(
-			"BOM is prohibited for '%s' if encoded as %s");
+			"BOM is prohibited in '%s' if encoded as %s");
+		/*
+		 * This advise is shown for UTF-??BE and UTF-??LE encodings.
+		 * We truncate the encoding name to 6 chars with %.6s to cut
+		 * off the last two "byte order" characters.
+		 */
 		const char *advise_msg = _(
-			"You told Git to treat '%s' as %s. A byte order mark "
-			"(BOM) is prohibited with this encoding. Either use "
-			"%.6s as working tree encoding or remove the BOM from the "
-			"file.");
-
-		advise(advise_msg, path, enc->name, enc->name, enc->name);
+			"The file '%s' contains a byte order mark (BOM). "
+			"Please use %.6s as working-tree-encoding.");
+		advise(advise_msg, path, enc->name);
 		if (conv_flags & CONV_WRITE_OBJECT)
 			die(error_msg, path, enc->name);
 		else
 			error(error_msg, path, enc->name);

-
 	} else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
 		const char *error_msg = _(
-			"BOM is required for '%s' if encoded as %s");
+			"BOM is required in '%s' if encoded as %s");
 		const char *advise_msg = _(
-			"You told Git to treat '%s' as %s. A byte order mark "
-			"(BOM) is required with this encoding. Either use "
-			"%sBE/%sLE as working tree encoding or add a BOM to the "
-			"file.");
-		advise(advise_msg, path, enc->name, enc->name, enc->name);
+			"The file '%s' is missing a byte order mark (BOM). "
+			"Please use %sBE or %sLE (depending on the byte order) "
+			"as working-tree-encoding.");
+		advise(advise_msg, path, enc->name, enc->name);
 		if (conv_flags & CONV_WRITE_OBJECT)
 			die(error_msg, path, enc->name);
 		else
@@ -405,7 +405,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len,
 	 * Unicode aims to be a superset of all other character encodings.
 	 * However, certain encodings (e.g. SHIFT-JIS) are known to have round
 	 * trip issues [2]. Check the round trip conversion for all encodings
-	 * listed in core.checkRoundTripEncoding.
+	 * listed in core.checkRoundtripEncoding.
 	 *
 	 * The round trip check is only performed if content is written to Git.
 	 * This ensures that no information is lost during conversion to/from
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 5dcdd5f899..e4717402a5 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -221,10 +221,10 @@ test_expect_success 'check roundtrip encoding' '
 	git reset &&

 	# ... unless we overwrite the Git config!
-	test_config core.checkRoundTripEncoding "garbage" &&
+	test_config core.checkRoundtripEncoding "garbage" &&
 	! GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null |
 		grep "Checking roundtrip encoding for SHIFT-JIS" &&
-	test_unconfig core.checkRoundTripEncoding &&
+	test_unconfig core.checkRoundtripEncoding &&
 	git reset &&

 	# UTF-16 encoded files should not be round-trip checked by default...
@@ -233,14 +233,14 @@ test_expect_success 'check roundtrip encoding' '
 	git reset &&

 	# ... unless we tell Git to check it!
-	test_config_global core.checkRoundTripEncoding "UTF-16, UTF-32" &&
+	test_config_global core.checkRoundtripEncoding "UTF-16, UTF-32" &&
 	GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
 		grep "Checking roundtrip encoding for UTF-16" &&
 	git reset &&

 	# ... unless we tell Git to check it!
 	# (here we also check that the casing of the encoding is irrelevant)
-	test_config_global core.checkRoundTripEncoding "UTF-32, utf-16" &&
+	test_config_global core.checkRoundtripEncoding "UTF-32, utf-16" &&
 	GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null |
 		grep "Checking roundtrip encoding for UTF-16" &&
 	git reset &&


### Patches

Lars Schneider (7):
  strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  strbuf: add xstrdup_toupper()
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: add function to detect a missing UTF-16/32 BOM
  convert: add 'working-tree-encoding' attribute
  convert: add tracing for 'working-tree-encoding' attribute
  convert: add round trip check based on 'core.checkRoundtripEncoding'

 Documentation/config.txt         |   6 +
 Documentation/gitattributes.txt  |  74 +++++++++++
 config.c                         |   5 +
 convert.c                        | 256 ++++++++++++++++++++++++++++++++++++++-
 convert.h                        |   2 +
 environment.c                    |   1 +
 sha1_file.c                      |   2 +-
 strbuf.c                         |  13 +-
 strbuf.h                         |   1 +
 t/t0028-working-tree-encoding.sh | 253 ++++++++++++++++++++++++++++++++++++++
 utf8.c                           |  37 ++++++
 utf8.h                           |  25 ++++
 12 files changed, 672 insertions(+), 3 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh


base-commit: 8a2f0888555ce46ac87452b194dec5cb66fb1417
--
2.16.1




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux