On Sat, Feb 24, 2018 at 11:28 AM, <lars.schneider@xxxxxxxxxxxx> wrote: > UTF supports lossless conversion round tripping and conversions between > UTF and other encodings are mostly round trip safe as Unicode aims to be > a superset of all other character encodings. However, certain encodings > (e.g. SHIFT-JIS) are known to have round trip issues [1]. > > Add 'core.checkRoundtripEncoding', which contains a comma separated > list of encodings, to define for what encodings Git should check the > conversion round trip if they are used in the 'working-tree-encoding' > attribute. > > Set SHIFT-JIS as default value for 'core.checkRoundtripEncoding'. > > [1] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode > > Signed-off-by: Lars Schneider <larsxschneider@xxxxxxxxx> > --- > diff --git a/convert.c b/convert.c > @@ -289,6 +289,39 @@ static void trace_encoding(const char *context, const char *path, > +static int check_roundtrip(const char* enc_name) > +{ > + /* > + * check_roundtrip_encoding contains a string of space and/or > + * comma separated encodings (eg. "UTF-16, ASCII, CP1125"). > + * Search for the given encoding in that string. > + */ > + const char *found = strcasestr(check_roundtrip_encoding, enc_name); > + const char *next = found + strlen(enc_name); Is this pointer arithmetic undefined behavior (according to the C standard) if NULL is returned by strcasestr()? If so, how comfortable are we with this? Perhaps if you add an 'if' into the mix, you can avoid it: if (found) { const char *next = found + strlen(enc_name); return ...long complicated expression...; } return false; > + int len = strlen(check_roundtrip_encoding); > + return (found && ( > + /* > + * check that the found encoding is at the > + * beginning of check_roundtrip_encoding or > + * that it is prefixed with a space or comma > + */ > + found == check_roundtrip_encoding || ( > + found > check_roundtrip_encoding && > + (*(found-1) == ' ' || *(found-1) == ',') Although the 'found > check_roundtrip_encoding' expression is effectively dead code in this case, it does document that the programmer didn't just blindly use '*(found-1)' without taking boundary conditions into consideration. (It's dead code because, at this point, we know that strcasestr() didn't return NULL and we know that 'found' doesn't equal 'check_roundtrip_encoding', therefore it _must_ be greater than 'check_roundtrip_encoding'.) Just an observation; not a request to remove the dead code since it has some documentation value. > + ) > + ) && ( > + /* > + * check that the found encoding is at the > + * end of check_roundtrip_encoding or > + * that it is suffixed with a space or comma > + */ > + next == check_roundtrip_encoding + len || ( > + next < check_roundtrip_encoding + len && > + (*next == ' ' || *next == ',') > + ) > + )); > +} > @@ -366,6 +399,47 @@ static int encode_to_git(const char *path, const char *src, size_t src_len, > + if ((conv_flags & CONV_WRITE_OBJECT) && check_roundtrip(enc->name)) { > + re_src = reencode_string_len(dst, dst_len, > + enc->name, default_encoding, > + &re_src_len); > + > + trace_printf("Checking roundtrip encoding for %s...\n", enc->name); > + trace_encoding("reencoded source", path, enc->name, > + re_src, re_src_len); > + > + if (!re_src || src_len != re_src_len || > + memcmp(src, re_src, src_len)) { > + const char* msg = _("encoding '%s' from %s to %s and " > + "back is not the same"); > + die(msg, path, enc->name, default_encoding); Last two arguments need to be swapped. > + } > + > + free(re_src); > + } > + > strbuf_attach(buf, dst, dst_len, dst_len + 1); > return 1; > } > diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh > @@ -225,4 +225,45 @@ test_expect_success 'error if encoding garbage is already in Git' ' > +test_expect_success 'check roundtrip encoding' ' > + text="hallo there!\nroundtrip test here!" && > + printf "$text" | iconv -f UTF-8 -t SHIFT-JIS >roundtrip.shift && > + printf "$text" | iconv -f UTF-8 -t UTF-16 >roundtrip.utf16 && > + echo "*.shift text working-tree-encoding=SHIFT-JIS" >>.gitattributes && > + > + # SHIFT-JIS encoded files are round-trip checked by default... > + GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null | > + grep "Checking roundtrip encoding for SHIFT-JIS" && Why redirect to /dev/null? If something does go wrong somewhere, the more output available for debugging the problem, the better, so throwing it away unnecessarily seems contraindicated. > + git reset && > + > + # ... unless we overwrite the Git config! > + test_config core.checkRoundtripEncoding "garbage" && > + ! GIT_TRACE=1 git add .gitattributes roundtrip.shift 2>&1 >/dev/null | > + grep "Checking roundtrip encoding for SHIFT-JIS" && > + test_unconfig core.checkRoundtripEncoding && The "unconfig" won't take place if the test fails. Instead of test_config/test_unconfig, you could use '-c' to set the config transiently for the git-add operation: ! GIT_TRACE=1 git -c core.checkRoundtripEncoding=garbage add ... > + git reset && > + > + # UTF-16 encoded files should not be round-trip checked by default... > + ! GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null | > + grep "Checking roundtrip encoding for UTF-16" && > + git reset && > + > + # ... unless we tell Git to check it! > + test_config_global core.checkRoundtripEncoding "UTF-16, UTF-32" && > + GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null | > + grep "Checking roundtrip encoding for UTF-16" && > + git reset && > + > + # ... unless we tell Git to check it! > + # (here we also check that the casing of the encoding is irrelevant) > + test_config_global core.checkRoundtripEncoding "UTF-32, utf-16" && > + GIT_TRACE=1 git add roundtrip.utf16 2>&1 >/dev/null | > + grep "Checking roundtrip encoding for UTF-16" && > + git reset && > + > + # cleanup > + rm roundtrip.shift roundtrip.utf16 && > + git reset --hard HEAD > +' Same comment as for patch 5/7: This cleanup won't happen if the test fails. Instead, use test_when_finished() earlier in the test: test_expect_success 'check roundtrip encoding' ' test_when_finished "rm -f roundtrip.shift roundtrip.utf16; git reset --hard HEAD" && ...