Re: git-p4 crashes on non UTF-8 output from p4

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thank you for your excellent and friendly feedback!

I understand everything you said, but I have a question about the unit
test you requested.  The git-p4.py script currently does not have
tests and is not written in a way that would be testable.  (The Python
function I modified calls into the shell and requires a valid Perforce
installation.)

Would you prefer I  a) refactor the code to be testable and then write
tests  or b) skip the unit testing (not sure if there are any further
options)?

(For option a) I would break out the part of the function I modified
into another function and then call my new function for my testing.
(I guess it would be better to break the refactoring and my changes
into 2 separate commits.)

On Sun, Apr 11, 2021 at 2:37 AM Torsten Bögershausen <tboegi@xxxxxx> wrote:
>
> On Sun, Apr 11, 2021 at 12:16:25AM -0700, Tzadik Vanderhoof wrote:
> > Here is the pull request:
>
> Thanks for the work. Some comments inline.
>
> >
> > From 8d234af842223dceae76ce0affd3bbb3f17bb6d9 Mon Sep 17 00:00:00 2001
> > From: Tzadik Vanderhoof <tzadik.vanderhoof@xxxxxxxxx>
> > Date: Sat, 10 Apr 2021 22:41:39 -0700
>
>
> The subject should be one short line, highlighting what this is all about,
> followed by a blank line and a longer description about the problem and
> the solution. The original description was good, see below.
>
> > Subject: [PATCH] add git-p4.fallbackEncoding config variable, to prevent
> >  git-p4 from crashing on non UTF-8 changeset descriptions
>
> In that sense I make a first trial here, subject for improvements:
>
>
> Subject: [PATCH] Add git-p4.fallbackEncoding config variable
>
> When git-p4 reads the output from a p4 command, it assumes it will be
> 100% UTF-8. If even one character in the output of one p4 command is
> not UTF-8, git-p4 crashes e.g. with:
>
> File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
>     value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
> decode byte Ox93 in position 42: invalid start byte
>
> Allow to try another encoding (eg cp1252) and/or use the
> Unicode replacement character  to prevent the whole program from crashing
> on such a "minor" problem.
>
> This is especially a problem on the "git p4 clone" command with @all,
> where git-p4 needs to read thousands of changeset descriptions, one of
> which may have a stray smart quote, causing the whole clone operation
> to fail.
>
> Introduce "git-p4.fallbackEncoding" to handle non UTF-8 encodings, if needed.
>
> >
> > ---
> >  Documentation/git-p4.txt | 10 ++++++++++
> >  git-p4.py                | 11 ++++++++++-
> >  2 files changed, 20 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> > index f89e68b..71f3487 100644
> > --- a/Documentation/git-p4.txt
> > +++ b/Documentation/git-p4.txt
> > @@ -638,6 +638,16 @@ git-p4.pathEncoding::
> >   to transcode the paths to UTF-8. As an example, Perforce on Windows
> >   often uses "cp1252" to encode path names.
> >
> > +git-p4.fallbackEncoding::
> > +    Perforce changeset descriptions can be in a mixture of encodings. Git-p4
> > +    first tries to interpret each description as UTF-8. If that fails, this
> > +    config allows another encoding to be tried.  The default is "cp1252".  You
>
> I know that cp1252 is attractive to be used, especially for Windows installations that
> use Latin-based "characters".
> But: If we introduce a new config-variable into Git, the default tends to be
> "if not set to anything, behave as the old Git".
>
> > +    can set it to another encoding, for example, "iso-8859-5". If instead of
> ISO-8859-5 may be more portable on the different i18 implementations
> than the lower-case spelling.
>
> > +    an encoding, you specify "replace", UTF-8 will be used, with invalid UTF-8
> > +    characters replaced by the Unicode replacement character. If you specify
> > +    "none", there is no fallback, and any non UTF-8 character will cause
> > +    git-p4 to immediately fail.
>
> As said, before, many people may expect Git to fail, so that the default should be
> none to avoid surprises.
> When a "non-UTF-8-clean" repo is handled, they want to know it.
>
> > +
> >  git-p4.largeFileSystem::
> >   Specify the system that is used for large (binary) files. Please note
> >   that large file systems do not support the 'git p4 submit' command.
> > diff --git a/git-p4.py b/git-p4.py
> > index 09c9e93..18d02b4 100755
> > --- a/git-p4.py
> > +++ b/git-p4.py
> > @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b',
> > cb=None, skip_info=False,
> >                  for key, value in entry.items():
> >                      key = key.decode()
> >                      if isinstance(value, bytes) and not (key in
> > ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> > -                        value = value.decode()
> > +                        try:
> > +                            value = value.decode()
> > +                        except:
> > +                            fallbackEncoding =
> > gitConfig("git-p4.fallbackEncoding").lower() or 'cp1252'
> > +                            if fallbackEncoding == 'none':
> > +                                raise
>
> Would it make sense to tell the user about the new config value here?
>  raise Exception("Non UTF-8 detected. See git-p4.fallbackEncoding"
> Or somewhat in that style ?
>
> > +                            elif fallbackEncoding == 'replace':
> > +                                value = value.decode(errors='replace')
> > +                            else:
> > +                                value = value.decode(encoding=fallbackEncoding)
> >                      decoded_entry[key] = value
> >                  # Parse out data if it's an error response
> >                  if decoded_entry.get('code') == 'error' and 'data' in
> > decoded_entry:
>
>
> Did I miss the Signed-off-by here?
>
> Please have a look here:
> https://git-scm.com/docs/SubmittingPatches
>
> (or look at Documentation/SubmittingPatches in your git source code)
>
> And finally: Thanks for the contribution.
> Is there any chance to add test-cases, to make sure that this feature
> is well-tested now and in the future ?
>
>
> > --
> > 2.31.1.windows.1
> >
> > On Fri, Apr 9, 2021 at 8:38 AM Torsten Bögershausen <tboegi@xxxxxx> wrote:
> > >
> > > On Thu, Apr 08, 2021 at 12:28:25PM -0700, Tzadik Vanderhoof wrote:
> > > > When git-p4 reads the output from a p4 command, it assumes it will be
> > > > 100% UTF-8. If even one character in the output of one p4 command is
> > > > not UTF-8, git-p4 crashes with:
> > > >
> > > > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
> > > >     value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
> > > > decode byte Ox93 in position 42: invalid start byte
> > > >
> > > > I'd like to make a pull request to have it try another encoding (eg
> > > > cp1252) and/or use the Unicode replacement character, to prevent the
> > > > whole program from crashing on such a minor problem.
> > > >
> > > > This is especially a problem on the "git p4 clone" command with @all,
> > > > where git-p4 needs to read thousands of changeset descriptions, one of
> > > > which may have a stray smart quote, causing the whole clone operation
> > > > to fail.
> > > >
> > > > Sound ok?
> > >
> > > Welcome to the Git community.
> > > To start with: I am not a git-p4 expert as such, but seeing that a program is crashing
> > > is never a good thing.
> > > All efforts to prevent the crash are a step forward.
> > >
> > > As you mention cp1252 (which is more used under Windows), there are probably lots of
> > > system out there which use ISO-8859-15 (or ISO-8859-1) we may have the first whish:
> > >
> > > Make the encoding/fallback configurable.
> > > Let people choose if they want a crash (if things are broken),
> > > fallback to cp1252 or one of the other ISO-ISO-8859-x encodings.
> > >
> > > In that sense: we look forward to a pull-request.
> >
> >
> >
> > --
> > Tzadik



-- 
Tzadik




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux