From: Lars Schneider <larsxschneider@xxxxxxxxx> Hi, Patches 1-4, 6 are preparation and helper functions. Patch 5,7 are the actual change. This series depends on Torsten's 8462ff43e4 (convert_to_git(): safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is already in master. Changes since v6: * use consistent casing for core.checkRoundtripEncoding (Junio) * fix gibberish in commit message (Junio) * improve documentation (Torsten) * improve advise messages (Torsten) Thanks, Lars RFC: https://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@xxxxxxxxx/ v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schneider@xxxxxxxxxxxx/ v2: https://public-inbox.org/git/20171229152222.39680-1-lars.schneider@xxxxxxxxxxxx/ v3: https://public-inbox.org/git/20180106004808.77513-1-lars.schneider@xxxxxxxxxxxx/ v4: https://public-inbox.org/git/20180120152418.52859-1-lars.schneider@xxxxxxxxxxxx/ v5: https://public-inbox.org/git/20180129201855.9182-1-tboegi@xxxxxx/ v6: https://public-inbox.org/git/20180209132830.55385-1-lars.schneider@xxxxxxxxxxxx/ Base Ref: Web-Diff: https://github.com/larsxschneider/git/commit/2b94bec353 Checkout: git fetch https://github.com/larsxschneider/git encoding-v7 && git checkout 2b94bec353 ### Interdiff (v6..v7): diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt index ea5a9509c6..10cb37795d 100644 --- a/Documentation/gitattributes.txt +++ b/Documentation/gitattributes.txt @@ -291,19 +291,20 @@ the content is reencoded back to the specified encoding. Please note that using the `working-tree-encoding` attribute may have a number of pitfalls: -- Git clients that do not support the `working-tree-encoding` attribute - will checkout the respective files UTF-8 encoded and not in the - expected encoding. Consequently, these files will appear different - which typically causes trouble. This is in particular the case for - older Git versions and alternative Git implementations such as JGit - or libgit2 (as of February 2018). +- Third party Git implementations that do not support the + `working-tree-encoding` attribute will checkout the respective files + UTF-8 encoded and not in the expected encoding. Consequently, these + files will appear different which typically causes trouble. This is + in particular the case for older Git versions and alternative Git + implementations such as JGit or libgit2 (as of February 2018). - Reencoding content to non-UTF encodings can cause errors as the conversion might not be UTF-8 round trip safe. If you suspect your - encoding to not be round trip safe, then add it to `core.checkRoundtripEncoding` - to make Git check the round trip encoding (see linkgit:git-config[1]). - SHIFT-JIS (Japanese character set) is known to have round trip issues - with UTF-8 and is checked by default. + encoding to not be round trip safe, then add it to + `core.checkRoundtripEncoding` to make Git check the round trip + encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character + set) is known to have round trip issues with UTF-8 and is checked by + default. - Reencoding content requires resources that might slow down certain Git operations (e.g 'git checkout' or 'git add'). @@ -327,7 +328,7 @@ explicitly define the line endings with `eol` if the `working-tree-encoding` attribute is used to avoid ambiguity. ------------------------ -*.proj working-tree-encoding=UTF-16LE text eol=CRLF +*.proj text working-tree-encoding=UTF-16LE eol=CRLF ------------------------ You can get a list of all available encodings on your platform with the diff --git a/convert.c b/convert.c index 71dffc7167..398cd9cf7b 100644 --- a/convert.c +++ b/convert.c @@ -352,29 +352,29 @@ static int encode_to_git(const char *path, const char *src, size_t src_len, if (has_prohibited_utf_bom(enc->name, src, src_len)) { const char *error_msg = _( - "BOM is prohibited for '%s' if encoded as %s"); + "BOM is prohibited in '%s' if encoded as %s"); + /* + * This advise is shown for UTF-??BE and UTF-??LE encodings. + * We truncate the encoding name to 6 chars with %.6s to cut + * off the last two "byte order" characters. + */ const char *advise_msg = _( - "You told Git to treat '%s' as %s. A byte order mark " - "(BOM) is prohibited with this encoding. Either use " - "%.6s as working tree encoding or remove the BOM from the " - "file."); - - advise(advise_msg, path, enc->name, enc->name, enc->name); + "The file '%s' contains a byte order mark (BOM). " + "Please use %.6s as working-tree-encoding."); + advise(advise_msg, path, enc->name); if (conv_flags & CONV_WRITE_OBJECT) die(error_msg, path, enc->name); else error(error_msg, path, enc->name); - } else if (is_missing_required_utf_bom(enc->name, src, src_len)) { const char *error_msg = _( - "BOM is required for '%s' if encoded as %s"); + "BOM is required in '%s' if encoded as %s"); const char *advise_msg = _( - "You told Git to treat '%s' as %s. A byte order mark " - "(BOM) is required with this encoding. Either use " - "%sBE/%sLE as working tree encoding or add a BOM to the " - "file."); - advise(advise_msg, path, enc->name, enc->name, enc->name); + "The file '%s' is missing a byte order mark (BOM). " + "Please use %sBE or %sLE (depending on the byte order) " + "as working-tree-encoding."); + advise(advise_msg, path, enc->name, enc->name); if (conv_flags & CONV_WRITE_OBJECT) die(error_msg, path, enc->name); else @@ -405,7 +405,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len, * Unicode aims to be a superset of all other character encodings. * However, certain encodings (e.g. SHIFT-JIS) are known to have round * trip issues [2]. Check the round trip conversion for all encodings - * listed in core.checkRoundTripEncoding. + * listed in core.checkRoundtripEncoding. * * The round trip check is only performed if content is written to Git. * This ensures that no information is lost during conversion to/from diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh index 5dcdd5f899..e4717402a5 100755 --- a/t/t0028-working-tree-encoding.sh +++ b/t/t0028-working-tree-encoding.sh @@ -221,10 +221,10 @@ test_expect_success 'check roundtrip encoding' ' git reset && # ... unless we overwrite the Git config! - test_config core.checkRoundTripEncoding "garbage" && + test_config core.checkRoundtripEncoding "garbage" && ! GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null | grep "Checking roundtrip encoding for SHIFT-JIS" && - test_unconfig core.checkRoundTripEncoding && + test_unconfig core.checkRoundtripEncoding && git reset && # UTF-16 encoded files should not be round-trip checked by default... @@ -233,14 +233,14 @@ test_expect_success 'check roundtrip encoding' ' git reset && # ... unless we tell Git to check it! - test_config_global core.checkRoundTripEncoding "UTF-16, UTF-32" && + test_config_global core.checkRoundtripEncoding "UTF-16, UTF-32" && GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null | grep "Checking roundtrip encoding for UTF-16" && git reset && # ... unless we tell Git to check it! # (here we also check that the casing of the encoding is irrelevant) - test_config_global core.checkRoundTripEncoding "UTF-32, utf-16" && + test_config_global core.checkRoundtripEncoding "UTF-32, utf-16" && GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null | grep "Checking roundtrip encoding for UTF-16" && git reset && ### Patches Lars Schneider (7): strbuf: remove unnecessary NUL assignment in xstrdup_tolower() strbuf: add xstrdup_toupper() utf8: add function to detect prohibited UTF-16/32 BOM utf8: add function to detect a missing UTF-16/32 BOM convert: add 'working-tree-encoding' attribute convert: add tracing for 'working-tree-encoding' attribute convert: add round trip check based on 'core.checkRoundtripEncoding' Documentation/config.txt | 6 + Documentation/gitattributes.txt | 74 +++++++++++ config.c | 5 + convert.c | 256 ++++++++++++++++++++++++++++++++++++++- convert.h | 2 + environment.c | 1 + sha1_file.c | 2 +- strbuf.c | 13 +- strbuf.h | 1 + t/t0028-working-tree-encoding.sh | 253 ++++++++++++++++++++++++++++++++++++++ utf8.c | 37 ++++++ utf8.h | 25 ++++ 12 files changed, 672 insertions(+), 3 deletions(-) create mode 100755 t/t0028-working-tree-encoding.sh base-commit: 8a2f0888555ce46ac87452b194dec5cb66fb1417 -- 2.16.1