Re: git-p4 crashes on non UTF-8 output from p4

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hej Tzadik,

Let's start with a side note:
This mailing list doesn't use top-posting, everything new is at the end,
so I moved everything there.

And removed some of the old stuff.

> On Sun, Apr 11, 2021 at 2:37 AM Torsten Bögershausen <tboegi@xxxxxx> wrote:
> >
> > On Sun, Apr 11, 2021 at 12:16:25AM -0700, Tzadik Vanderhoof wrote:
> > > Here is the pull request:
> >
> > Thanks for the work. Some comments inline.
> >
> > >
> > > From 8d234af842223dceae76ce0affd3bbb3f17bb6d9 Mon Sep 17 00:00:00 2001
> > > From: Tzadik Vanderhoof <tzadik.vanderhoof@xxxxxxxxx>
> > > Date: Sat, 10 Apr 2021 22:41:39 -0700
> >
> >
> > The subject should be one short line, highlighting what this is all about,
> > followed by a blank line and a longer description about the problem and
> > the solution. The original description was good, see below.
> >
> > > Subject: [PATCH] add git-p4.fallbackEncoding config variable, to prevent
> > >  git-p4 from crashing on non UTF-8 changeset descriptions
> >
> > In that sense I make a first trial here, subject for improvements:
> >
> >
> > Subject: [PATCH] Add git-p4.fallbackEncoding config variable
> >
> > When git-p4 reads the output from a p4 command, it assumes it will be
> > 100% UTF-8. If even one character in the output of one p4 command is
> > not UTF-8, git-p4 crashes e.g. with:
> >
> > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
> >     value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
> > decode byte Ox93 in position 42: invalid start byte
> >
> > Allow to try another encoding (eg cp1252) and/or use the
> > Unicode replacement character  to prevent the whole program from crashing
> > on such a "minor" problem.
> >
> > This is especially a problem on the "git p4 clone" command with @all,
> > where git-p4 needs to read thousands of changeset descriptions, one of
> > which may have a stray smart quote, causing the whole clone operation
> > to fail.
> >
> > Introduce "git-p4.fallbackEncoding" to handle non UTF-8 encodings, if needed.
> >
> > >
> > > ---
> > >  Documentation/git-p4.txt | 10 ++++++++++
> > >  git-p4.py                | 11 ++++++++++-
> > >  2 files changed, 20 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> > > index f89e68b..71f3487 100644
> > > --- a/Documentation/git-p4.txt
> > > +++ b/Documentation/git-p4.txt
> > > @@ -638,6 +638,16 @@ git-p4.pathEncoding::
> > >   to transcode the paths to UTF-8. As an example, Perforce on Windows
> > >   often uses "cp1252" to encode path names.
> > >
> > > +git-p4.fallbackEncoding::
> > > +    Perforce changeset descriptions can be in a mixture of encodings. Git-p4
> > > +    first tries to interpret each description as UTF-8. If that fails, this
> > > +    config allows another encoding to be tried.  The default is "cp1252".  You
> >
> > I know that cp1252 is attractive to be used, especially for Windows installations that
> > use Latin-based "characters".
> > But: If we introduce a new config-variable into Git, the default tends to be
> > "if not set to anything, behave as the old Git".
> >
> > > +    can set it to another encoding, for example, "iso-8859-5". If instead of
> > ISO-8859-5 may be more portable on the different i18 implementations
> > than the lower-case spelling.
> >
> > > +    an encoding, you specify "replace", UTF-8 will be used, with invalid UTF-8
> > > +    characters replaced by the Unicode replacement character. If you specify
> > > +    "none", there is no fallback, and any non UTF-8 character will cause
> > > +    git-p4 to immediately fail.
> >
> > As said, before, many people may expect Git to fail, so that the default should be
> > none to avoid surprises.
> > When a "non-UTF-8-clean" repo is handled, they want to know it.
> >
> > > +
> > >  git-p4.largeFileSystem::
> > >   Specify the system that is used for large (binary) files. Please note
> > >   that large file systems do not support the 'git p4 submit' command.
> > > diff --git a/git-p4.py b/git-p4.py
> > > index 09c9e93..18d02b4 100755
> > > --- a/git-p4.py
> > > +++ b/git-p4.py
> > > @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b',
> > > cb=None, skip_info=False,
> > >                  for key, value in entry.items():
> > >                      key = key.decode()
> > >                      if isinstance(value, bytes) and not (key in
> > > ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> > > -                        value = value.decode()
> > > +                        try:
> > > +                            value = value.decode()
> > > +                        except:
> > > +                            fallbackEncoding =
> > > gitConfig("git-p4.fallbackEncoding").lower() or 'cp1252'
> > > +                            if fallbackEncoding == 'none':
> > > +                                raise
> >
> > Would it make sense to tell the user about the new config value here?
> >  raise Exception("Non UTF-8 detected. See git-p4.fallbackEncoding"
> > Or somewhat in that style ?
> >
> > > +                            elif fallbackEncoding == 'replace':
> > > +                                value = value.decode(errors='replace')
> > > +                            else:
> > > +                                value = value.decode(encoding=fallbackEncoding)
> > >                      decoded_entry[key] = value
> > >                  # Parse out data if it's an error response
> > >                  if decoded_entry.get('code') == 'error' and 'data' in
> > > decoded_entry:
> >
> >
> > Did I miss the Signed-off-by here?
> >
> > Please have a look here:
> > https://git-scm.com/docs/SubmittingPatches
> >
> > (or look at Documentation/SubmittingPatches in your git source code)
> >
> > And finally: Thanks for the contribution.
> > Is there any chance to add test-cases, to make sure that this feature
> > is well-tested now and in the future ?
> >
[snip]


On Sun, Apr 11, 2021 at 01:21:47PM -0700, Tzadik Vanderhoof wrote:
> Thank you for your excellent and friendly feedback!
>
> I understand everything you said, but I have a question about the unit
> test you requested.  The git-p4.py script currently does not have
> tests and is not written in a way that would be testable.  (The Python
> function I modified calls into the shell and requires a valid Perforce
> installation.)

I am not really sure about that (there are no test cases).
A valid Perforce installation is needed, yes. Otherwise the p4 tests are skipped.
There are a lot of p4 tests under t/t98..p4....sh

git show a8b05162e894b88aeb7d5064dab

Tells me e.g. that both git-p4.py and, in this very commit,
t/t9822-git-p4-path-encoding.sh have been changed.

All the test are t98xx-git-p4-something.sh (fiund under t),
and you new testcase may be named something like

t9835-git-p4-fallbackEncoding.sh

>
> Would you prefer I  a) refactor the code to be testable and then write
> tests  or b) skip the unit testing (not sure if there are any further
> options)?
>
> (For option a) I would break out the part of the function I modified
> into another function and then call my new function for my testing.
> (I guess it would be better to break the refactoring and my changes
> into 2 separate commits.)
>

Probably not. Typically we can construct a test case first,
and see that ot fails (in you local test-running).
After updating git-p4.py with your improvments the new test should pass.
And of course all the other p4 tests.

There is a whole bunch of "CI tests", run on Linux, MacOs, Windows.
One example of a test run is here, and the "regular (linux-clang,...)
is running the p4 tests:

https://github.com/git/git/runs/2309995044?check_suite_focus=true




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux