From: Lars Schneider <larsxschneider@xxxxxxxxx> Hi, Patches 1-4, 6 are preparation and helper functions. Patch 5,7 are the actual change. This series depends on Torsten's 8462ff43e4 (convert_to_git(): safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is already in master. Changes since v7: * make it clearer in the documentation that Git stores content "as-is" by default. Content is only stored in UTF-8 if w-t-e is used (Junio) * add test case for $GIT_DIR/info/attributes support (Junio) Thanks, Lars RFC: https://public-inbox.org/git/BDB9B884-6D17-4BE3-A83C-F67E2AFA2B46@xxxxxxxxx/ v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schneider@xxxxxxxxxxxx/ v2: https://public-inbox.org/git/20171229152222.39680-1-lars.schneider@xxxxxxxxxxxx/ v3: https://public-inbox.org/git/20180106004808.77513-1-lars.schneider@xxxxxxxxxxxx/ v4: https://public-inbox.org/git/20180120152418.52859-1-lars.schneider@xxxxxxxxxxxx/ v5: https://public-inbox.org/git/20180129201855.9182-1-tboegi@xxxxxx/ v6: https://public-inbox.org/git/20180209132830.55385-1-lars.schneider@xxxxxxxxxxxx/ v7: https://public-inbox.org/git/20180215152711.158-1-lars.schneider@xxxxxxxxxxxx/ Base Ref: Web-Diff: https://github.com/larsxschneider/git/commit/2758a2da29 Checkout: git fetch https://github.com/larsxschneider/git encoding-v8 && git checkout 2758a2da29 ### Interdiff (v7..v8): diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt index 10cb37795d..11315054f4 100644 --- a/Documentation/gitattributes.txt +++ b/Documentation/gitattributes.txt @@ -275,11 +275,11 @@ few exceptions. Even though... `working-tree-encoding` ^^^^^^^^^^^^^^^^^^^^^^^ -Git recognizes files encoded with ASCII or one of its supersets (e.g. -UTF-8 or ISO-8859-1) as text files. All other encodings are usually -interpreted as binary and consequently built-in Git text processing -tools (e.g. 'git diff') as well as most Git web front ends do not -visualize the content. +Git recognizes files encoded in ASCII or one of its supersets (e.g. +UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other +encodings (e.g. UTF-16) are interpreted as binary and consequently +built-in Git text processing tools (e.g. 'git diff') as well as most Git +web front ends do not visualize the contents of these files by default. In these cases you can tell Git the encoding of a file in the working directory with the `working-tree-encoding` attribute. If a file with this @@ -291,12 +291,24 @@ the content is reencoded back to the specified encoding. Please note that using the `working-tree-encoding` attribute may have a number of pitfalls: -- Third party Git implementations that do not support the - `working-tree-encoding` attribute will checkout the respective files - UTF-8 encoded and not in the expected encoding. Consequently, these - files will appear different which typically causes trouble. This is - in particular the case for older Git versions and alternative Git - implementations such as JGit or libgit2 (as of February 2018). +- Alternative Git implementations (e.g. JGit or libgit2) and older Git + versions (as of March 2018) do not support the `working-tree-encoding` + attribute. If you decide to use the `working-tree-encoding` attribute + in your repository, then it is strongly recommended to ensure that all + clients working with the repository support it. + + If you declare `*.proj` files as UTF-16 and you add `foo.proj` with an + `working-tree-encoding` enabled Git client, then `foo.proj` will be + stored as UTF-8 internally. A client without `working-tree-encoding` + support will checkout `foo.proj` as UTF-8 encoded file. This will + typically cause trouble for the users of this file. + + If a Git client, that does not support the `working-tree-encoding` + attribute, adds a new file `bar.proj`, then `bar.proj` will be + stored "as-is" internally (in this example probably as UTF-16). + A client with `working-tree-encoding` support will interpret the + internal contents as UTF-8 and try to convert it to UTF-16 on checkout. + That operation will fail and cause an error. - Reencoding content to non-UTF encodings can cause errors as the conversion might not be UTF-8 round trip safe. If you suspect your diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh index e4717402a5..e34c21eb29 100755 --- a/t/t0028-working-tree-encoding.sh +++ b/t/t0028-working-tree-encoding.sh @@ -13,8 +13,11 @@ test_expect_success 'setup test repo' ' echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes && printf "$text" >test.utf8.raw && printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw && + printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw && cp test.utf16.raw test.utf16 && + cp test.utf32.raw test.utf32 && + # Add only UTF-16 file, we will add the UTF-32 file later git add .gitattributes test.utf16 && git commit -m initial ' @@ -24,7 +27,7 @@ test_expect_success 'ensure UTF-8 is stored in Git' ' test_cmp_bin test.utf8.raw test.utf16.git && # cleanup - rm test.utf8.raw test.utf16.git + rm test.utf16.git ' test_expect_success 're-encode to UTF-16 on checkout' ' @@ -36,6 +39,19 @@ test_expect_success 're-encode to UTF-16 on checkout' ' rm test.utf16.raw ' +test_expect_success 'check $GIT_DIR/info/attributes support' ' + echo "*.utf32 text working-tree-encoding=utf-32" >.git/info/attributes && + + git add test.utf32 && + + git cat-file -p :test.utf32 >test.utf32.git && + test_cmp_bin test.utf8.raw test.utf32.git && + + # cleanup + git reset --hard HEAD && + rm test.utf8.raw test.utf32.raw test.utf32.git +' + test_expect_success 'check prohibited UTF BOM' ' printf "\0a\0b\0c" >nobom.utf16be.raw && printf "a\0b\0c\0" >nobom.utf16le.raw && ### Patches Lars Schneider (7): strbuf: remove unnecessary NUL assignment in xstrdup_tolower() strbuf: add xstrdup_toupper() utf8: add function to detect prohibited UTF-16/32 BOM utf8: add function to detect a missing UTF-16/32 BOM convert: add 'working-tree-encoding' attribute convert: add tracing for 'working-tree-encoding' attribute convert: add round trip check based on 'core.checkRoundtripEncoding' Documentation/config.txt | 6 + Documentation/gitattributes.txt | 86 +++++++++++++ config.c | 5 + convert.c | 256 ++++++++++++++++++++++++++++++++++++- convert.h | 2 + environment.c | 1 + sha1_file.c | 2 +- strbuf.c | 13 +- strbuf.h | 1 + t/t0028-working-tree-encoding.sh | 269 +++++++++++++++++++++++++++++++++++++++ utf8.c | 37 ++++++ utf8.h | 25 ++++ 12 files changed, 700 insertions(+), 3 deletions(-) create mode 100755 t/t0028-working-tree-encoding.sh base-commit: 8a2f0888555ce46ac87452b194dec5cb66fb1417 -- 2.16.1