On Wed, Apr 21, 2021 at 10:05:04PM -0700, Tzadik Vanderhoof wrote: Thanks for V3, please see some comments inline. > When git-p4 reads the output from a p4 command, it assumes it will be > 100% UTF-8. If even one character in the output of one p4 command is > not UTF-8, git-p4 crashes with: > > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList > value = value.decode() UnicodeDecodeError: 'utf-8' codec can't > decode byte Ox93 in position 42: invalid start byte > > This is especially a problem on the "git p4 clone ... @all" command, > where git-p4 needs to read thousands of changeset descriptions, one of > which may have a stray smart quote, causing the whole clone operation > to fail. > > This pull request adds a new config setting, allowing git-p4 to try > another encoding (for example, "cp1252") and/or use the Unicode replacement > character, to prevent the whole program from crashing on such a minor problem. "This pull request" is somewhat superflous wording. How about: Add a new config setting, allowing git-p4 to try a fallback encoding (for example, "cp1252") and/or use the Unicode replacement character, to prevent the whole program from crashing on such a minor problem. Documentation is good (and needed, and neccessary). Probably this is then not needed: > See the documentation included in the patch for more details of how > the new config setting works. > > Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@xxxxxxxxx> > --- > Documentation/git-p4.txt | 10 ++++ > git-p4.py | 11 +++- > t/t9835-git-p4-config-fallback-encoding.sh | 65 ++++++++++++++++++++++ > 3 files changed, 85 insertions(+), 1 deletion(-) > create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh > > diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt > index f89e68b..e0131a9 100644 > --- a/Documentation/git-p4.txt > +++ b/Documentation/git-p4.txt > @@ -638,6 +638,16 @@ git-p4.pathEncoding:: > to transcode the paths to UTF-8. As an example, Perforce on Windows > often uses "cp1252" to encode path names. > > +git-p4.fallbackEncoding:: > + Perforce changeset descriptions can be in a mixture of encodings. > + Git-p4 first tries to interpret each description as UTF-8. If that > + fails, this config allows another encoding to be tried. You > + can specify, for example, "cp1252". That looks OK according to https://docs.python.org/3/library/codecs.html#standard-encodings > + If instead of an encoding, > + you specify "replace", UTF-8 will be used, with invalid UTF-8 > + characters replaced by the Unicode replacement character. If you > + specify "none" (the default), there is no fallback, and any non > + UTF-8 character will cause git-p4 to immediately fail. > + May be, that is a matter of taste: > + If git-p4.fallbackEncoding is "replace" ", UTF-8 will be used, with invalid UTF-8 > + characters replaced by the Unicode replacement character. > + The default is "none": there is no fallback, and any non > + UTF-8 character will cause git-p4 to immediately fail. > + > git-p4.largeFileSystem:: > Specify the system that is used for large (binary) files. Please note > that large file systems do not support the 'git p4 submit' command. > diff --git a/git-p4.py b/git-p4.py > index 09c9e93..3364287 100755 > --- a/git-p4.py > +++ b/git-p4.py > @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, > for key, value in entry.items(): > key = key.decode() > if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')): > - value = value.decode() > + try: > + value = value.decode() > + except UnicodeDecodeError as ex: > + fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none' > + if fallbackEncoding == 'none': > + raise Exception("UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding") > + elif fallbackEncoding == 'replace': > + value = value.decode(errors='replace') > + else: > + value = value.decode(encoding=fallbackEncoding) > decoded_entry[key] = value > # Parse out data if it's an error response > if decoded_entry.get('code') == 'error' and 'data' in decoded_entry: > diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh > new file mode 100755 > index 0000000..56a245e > --- /dev/null > +++ b/t/t9835-git-p4-config-fallback-encoding.sh > @@ -0,0 +1,65 @@ > +#!/bin/sh > + > +test_description='test git-p4.fallbackEncoding config' > + > +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main > +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME > + > +. ./lib-git-p4.sh > + > +if test_have_prereq !MINGW,!CYGWIN; then > + skip_all='This system is not subject to encoding failures in "git p4 clone"' > + test_done > +fi Out of curiosity: Why are Windows versions (MINGW, CYGWIN) excluded ? > + > +test_expect_success 'start p4d' ' > + start_p4d > +' > + > +test_expect_success 'add cp1252 description' ' > + cd "$cli" && > + echo file1 >file1 && > + p4 add file1 && > + p4 submit -d documentación > +' > + > +test_expect_success 'clone fails with git-p4.fallbackEncoding unset' ' > + test_might_fail git config --global --unset git-p4.fallbackEncoding && > + test_when_finished cleanup_git && > + ( > + test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual && Should this be >actual ? And please no ' ' between the '>' and the filename. > + grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual > + ) > +' > +test_expect_success 'clone fails with git-p4.fallbackEncoding set to "none"' ' > + git config --global git-p4.fallbackEncoding none && > + test_when_finished cleanup_git && > + ( > + test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual && Same here 2 >actual > + grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual > + ) > +' > + > +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' ' > + git config --global git-p4.fallbackEncoding cp1252 && > + test_when_finished cleanup_git && > + ( > + git p4 clone --dest="$git" //depot@all && > + cd "$git" && > + git log --oneline >log && > + desc=$(head -1 log | awk '\''{print $2}'\'') && [ "$desc" = "documentación" ] Style nit: See Documentation/CodingGuidelines: - We prefer "test" over "[ ... ]". desc=$(head -1 log | awk '\''{print $2}'\'') && test "$desc" = "documentación" > + ) > +' > + > +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "replace"' ' > + git config --global git-p4.fallbackEncoding replace && > + test_when_finished cleanup_git && > + ( > + git p4 clone --dest="$git" //depot@all && > + cd "$git" && > + git log --oneline >log && > + desc=$(head -1 log | awk '\''{print $2}'\'') && [ "$desc" = "documentaci�n" ] > + ) > +' > + > +test_done > -- > 2.31.1.windows.1 > Are there any more comments from the p4 experts ?