On 29/04/2021 07:39, Tzadik Vanderhoof wrote:
Add git-p4.fallbackEncoding config variable, to prevent git-p4 from
crashing on non UTF-8 changeset descriptions.
When git-p4 reads the output from a p4 command, it assumes it will
be 100% UTF-8. If even one character in the output of one p4 command is
not UTF-8, git-p4 crashes with:
File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
decode byte Ox93 in position 42: invalid start byte
This is especially a problem for the "git p4 clone ... @all" command,
where git-p4 needs to read thousands of changeset descriptions, one of
which may have a stray smart quote, causing the whole clone operation
to fail.
Add a new config setting, allowing git-p4 to try a fallback encoding
(for example, "cp1252") and/or use the Unicode replacement character,
to prevent the whole program from crashing on such a minor problem.
I think Andrew Oakley pointed this out earlier - but in the days of
Python2 this was (I think) never a problem. Python2 just took in the
binary data, in whatever encoding, and passed it untouched on to git,
which in turn just stored it.
https://lore.kernel.org/git/20210412085251.51475-1-andrew@xxxxxxxxxxxxx/
It was only whatever was trying to render the bytestream that needed to
worry about the encoding.
Now we're making the decision in git-p4 when we ingest it - did we
consider just passing it along untouched?
The problem at hand is that git-p4 is trying to store it internally as a
`string' which now is unicode-aware, when perhaps it should not be.
It's going to get very confusing if anyone ingests something from
Perforce having set the encoding to one thing, and it turns out to be a
different encoding, or worse, multiple encodings for the same repo.
I also worry that if someone has connected to a Unicode-aware Perforce
server, and then unwittingly set P4CHARSET, then are we going to end up
silently scrambling everything?
I'm not sure, encodings make my head hurt.
Luke
Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@xxxxxxxxx>
---
Documentation/git-p4.txt | 9 ++
git-p4.py | 11 ++-
t/t9836-git-p4-config-fallback-encoding.sh | 98 ++++++++++++++++++++++
3 files changed, 117 insertions(+), 1 deletion(-)
create mode 100755 t/t9836-git-p4-config-fallback-encoding.sh
diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
index f89e68b424..86d3ffa644 100644
--- a/Documentation/git-p4.txt
+++ b/Documentation/git-p4.txt
@@ -638,6 +638,15 @@ git-p4.pathEncoding::
to transcode the paths to UTF-8. As an example, Perforce on Windows
often uses "cp1252" to encode path names.
+git-p4.fallbackEncoding::
+ Perforce changeset descriptions can be stored in any encoding.
+ Git-p4 first tries to interpret each description as UTF-8. If that
+ fails, this config allows another encoding to be tried. You can specify,
+ for example, "cp1252". If git-p4.fallbackEncoding is "replace", UTF-8 will
+ be used, with invalid UTF-8 characters replaced by the Unicode replacement
+ character. The default is "none": there is no fallback, and any non UTF-8
+ character will cause git-p4 to immediately fail.
+
git-p4.largeFileSystem::
Specify the system that is used for large (binary) files. Please note
that large file systems do not support the 'git p4 submit' command.
diff --git a/git-p4.py b/git-p4.py
index 09c9e93ac4..202fb01bdf 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
for key, value in entry.items():
key = key.decode()
if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
- value = value.decode()
+ try:
+ value = value.decode()
+ except UnicodeDecodeError:
+ fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none'
+ if fallbackEncoding == 'none':
+ raise Exception("UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding")
+ elif fallbackEncoding == 'replace':
+ value = value.decode(errors='replace')
+ else:
+ value = value.decode(encoding=fallbackEncoding)
decoded_entry[key] = value
# Parse out data if it's an error response
if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
diff --git a/t/t9836-git-p4-config-fallback-encoding.sh b/t/t9836-git-p4-config-fallback-encoding.sh
new file mode 100755
index 0000000000..901bb3759d
--- /dev/null
+++ b/t/t9836-git-p4-config-fallback-encoding.sh
@@ -0,0 +1,98 @@
+#!/bin/sh
+
+test_description='test git-p4.fallbackEncoding config'
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./lib-git-p4.sh
+
+test_expect_success 'start p4d' '
+ start_p4d
+'
+
+test_expect_success 'add Unicode description' '
+ cd "$cli" &&
+ echo file1 >file1 &&
+ p4 add file1 &&
+ p4 submit -d documentación
+'
+
+# Unicode descriptions cause "git p4 clone" to crash with a UnicodeDecodeError in some
+# environments. This test determines if that is the case in our environment. If so,
+# we create a file called "clone_fails". In subsequent tests, we check whether that
+# file exists to determine what behavior to expect.
+
+clone_fails="$TRASH_DIRECTORY/clone_fails"
+
+# If clone fails with git-p4.fallbackEncoding set to "none", create the "clone_fails" file,
+# and make sure the error message is correct
+
+test_expect_success 'clone with git-p4.fallbackEncoding set to "none"' '
+ git config --global git-p4.fallbackEncoding none &&
+ test_when_finished cleanup_git && {
+ git p4 clone --dest="$git" //depot@all 2>error || (
+ >"$clone_fails" &&
+ grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
+ )
+ }
+'
+
+# If clone fails with git-p4.fallbackEncoding set to "none", it should also fail when it's unset,
+# also with the correct error message. Otherwise the clone should succeed.
+
+test_expect_success 'clone with git-p4.fallbackEncoding unset' '
+ git config --global --unset git-p4.fallbackEncoding &&
+ test_when_finished cleanup_git && {
+ (
+ test -f "$clone_fails" &&
+ test_must_fail git p4 clone --dest="$git" //depot@all 2>error &&
+ grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
+ ) ||
+ (
+ ! test -f "$clone_fails" &&
+ git p4 clone --dest="$git" //depot@all 2>error
+ )
+ }
+'
+
+# Whether or not "clone_fails" exists, setting git-p4.fallbackEncoding
+# to "cp1252" should cause clone to succeed and get the right description
+
+test_expect_success 'clone with git-p4.fallbackEncoding set to "cp1252"' '
+ git config --global git-p4.fallbackEncoding cp1252 &&
+ test_when_finished cleanup_git &&
+ (
+ git p4 clone --dest="$git" //depot@all &&
+ cd "$git" &&
+ git log --oneline >log &&
+ desc=$(head -1 log | cut -d" " -f2) &&
+ test "$desc" = "documentación"
+ )
+'
+
+# Setting git-p4.fallbackEncoding to "replace" should always cause clone to succeed.
+# If "clone_fails" exists, the description should contain the Unicode replacement
+# character, otherwise the description should be correct (since we're on a system that
+# doesn't have the Unicode issue)
+
+test_expect_success 'clone with git-p4.fallbackEncoding set to "replace"' '
+ git config --global git-p4.fallbackEncoding replace &&
+ test_when_finished cleanup_git &&
+ (
+ git p4 clone --dest="$git" //depot@all &&
+ cd "$git" &&
+ git log --oneline >log &&
+ desc=$(head -1 log | cut -d" " -f2) &&
+ {
+ (test -f "$clone_fails" &&
+ test "$desc" = "documentaci�n"
+ ) ||
+ (! test -f "$clone_fails" &&
+ test "$desc" = "documentación"
+ )
+ }
+ )
+'
+
+test_done