Re: git-p4 crashes on non UTF-8 output from p4

Tzadik Vanderhoof <tzadik.vanderhoof@xxxxxxxxx> · Sun, 11 Apr 2021 00:16:25 -0700

Here is the pull request:

>From 8d234af842223dceae76ce0affd3bbb3f17bb6d9 Mon Sep 17 00:00:00 2001
From: Tzadik Vanderhoof <tzadik.vanderhoof@xxxxxxxxx>
Date: Sat, 10 Apr 2021 22:41:39 -0700
Subject: [PATCH] add git-p4.fallbackEncoding config variable, to prevent
 git-p4 from crashing on non UTF-8 changeset descriptions

---
 Documentation/git-p4.txt | 10 ++++++++++
 git-p4.py                | 11 ++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
index f89e68b..71f3487 100644
--- a/Documentation/git-p4.txt
+++ b/Documentation/git-p4.txt
@@ -638,6 +638,16 @@ git-p4.pathEncoding::
  to transcode the paths to UTF-8. As an example, Perforce on Windows
  often uses "cp1252" to encode path names.

+git-p4.fallbackEncoding::
+    Perforce changeset descriptions can be in a mixture of encodings. Git-p4
+    first tries to interpret each description as UTF-8. If that fails, this
+    config allows another encoding to be tried.  The default is "cp1252".  You
+    can set it to another encoding, for example, "iso-8859-5". If instead of
+    an encoding, you specify "replace", UTF-8 will be used, with invalid UTF-8
+    characters replaced by the Unicode replacement character. If you specify
+    "none", there is no fallback, and any non UTF-8 character will cause
+    git-p4 to immediately fail.
+
 git-p4.largeFileSystem::
  Specify the system that is used for large (binary) files. Please note
  that large file systems do not support the 'git p4 submit' command.
diff --git a/git-p4.py b/git-p4.py
index 09c9e93..18d02b4 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b',
cb=None, skip_info=False,
                 for key, value in entry.items():
                     key = key.decode()
                     if isinstance(value, bytes) and not (key in
('data', 'path', 'clientFile') or key.startswith('depotFile')):
-                        value = value.decode()
+                        try:
+                            value = value.decode()
+                        except:
+                            fallbackEncoding =
gitConfig("git-p4.fallbackEncoding").lower() or 'cp1252'
+                            if fallbackEncoding == 'none':
+                                raise
+                            elif fallbackEncoding == 'replace':
+                                value = value.decode(errors='replace')
+                            else:
+                                value = value.decode(encoding=fallbackEncoding)
                     decoded_entry[key] = value
                 # Parse out data if it's an error response
                 if decoded_entry.get('code') == 'error' and 'data' in
decoded_entry:
-- 
2.31.1.windows.1

On Fri, Apr 9, 2021 at 8:38 AM Torsten Bögershausen <tboegi@xxxxxx> wrote:
>
> On Thu, Apr 08, 2021 at 12:28:25PM -0700, Tzadik Vanderhoof wrote:
> > When git-p4 reads the output from a p4 command, it assumes it will be
> > 100% UTF-8. If even one character in the output of one p4 command is
> > not UTF-8, git-p4 crashes with:
> >
> > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
> >     value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
> > decode byte Ox93 in position 42: invalid start byte
> >
> > I'd like to make a pull request to have it try another encoding (eg
> > cp1252) and/or use the Unicode replacement character, to prevent the
> > whole program from crashing on such a minor problem.
> >
> > This is especially a problem on the "git p4 clone" command with @all,
> > where git-p4 needs to read thousands of changeset descriptions, one of
> > which may have a stray smart quote, causing the whole clone operation
> > to fail.
> >
> > Sound ok?
>
> Welcome to the Git community.
> To start with: I am not a git-p4 expert as such, but seeing that a program is crashing
> is never a good thing.
> All efforts to prevent the crash are a step forward.
>
> As you mention cp1252 (which is more used under Windows), there are probably lots of
> system out there which use ISO-8859-15 (or ISO-8859-1) we may have the first whish:
>
> Make the encoding/fallback configurable.
> Let people choose if they want a crash (if things are broken),
> fallback to cp1252 or one of the other ISO-ISO-8859-x encodings.
>
> In that sense: we look forward to a pull-request.



-- 
Tzadik