Re: [PATCH 0/6] Transition git-p4.py to support Python 3 only

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Fri, 10 Dec 2021 12:30:01 +0100

On Fri, Dec 10 2021, Joel Holdsworth wrote:

>> The commit messages could just really use some extra hand-holding and
>> explanation, and a clear split-out of things related to the version bump v.s.
>> things not needed for that, or unrelated refactorings.
>
> Yes, I am getting this message loud and clear. I will resubmit with more detailed commit messages today.

Thanks...

> To explain the story here: I started using git-p4 as part of my
> work-flow, and I expect to need it for several years to come. As I
> began to use it, I found that a series of bugs - mostly related to
> character encoding. In fixing these, I found that some of the troubles
> were specific to Python 3 - or rather Python 2's less strict approach
> to distinguishing between byte sequences and textual strings allowed
> the script to proceed Python 2 even though what it was doing was in
> fact invalid, and was potentially corrupting data.
>
> A common problem that users are encountering is that the script
> attempts to decode incoming textual byte-streams into UTF-8
> strings. On Python 3 this fails with an exception if the data contains
> invalid UTF-8 codes. For text files created in Windows, CP1252 Smart
> Quote characters: 0x93 and 0x94 are seen fairly frequently. These
> codes are invalid in UTF-8, so if the script encounters any file or
> file name containing them, it will fail with an exception.
>
> Tzadik Vanderhoof submitted a patch attempting to fix some of these issues in April 2021:
> https://lore.kernel.org/git/20210429073905.837-1-tzadik.vanderhoof@xxxxxxxxx/
>
> My two comments about this patch are that 1. It doesn't fix my issue, and 2. Even with the proposed fallbackEncoding option it still leaves git-p4 broken by default.
>
> A fallbackEncoding option may still be necessary, but I found that most of the issues I encountered could be side-stepped by simply avoiding decoding incoming data into UTF-8 in the first place.
>
> Keeping a clean separation between encoded and decoded text is much
> easier to do in Python 3. If Python 2 support must be maintained, this
> will require careful testing of separate code-paths both platforms
> which I don't regard as reasonable given that Python 2 is thoroughly
> deprecated. Therefore, this first patch-set focusses primarily on
> removing Python 2 support.

This all makes perfect sense to me (having never used git-p4). Having
this sort of explanation be part of the relevant commit message would be
great :)

I haven't worked extensively with Python myself, but I've understood
that its Unicode support was a big pain before v3 as you describe, which
is just the sort of thing that would justify a version prereq bump, even
if it's a bit painful to some users on older systems (if even that,
maybe everyone's upgraded already...)

> Furthermore, because I expect to be using git-p4 in my daily work-flow
> for some time to come, I am interested in investing some effort into
> improving it. There are bugs, unreliable behaviour, user-hostile
> behaviour, as well as code that would benefit from clean-up and
> modernisation. In submitting these patches, I am trying to get a read
> on to what extent such efforts would be accepted by the Git
> maintainers.

I don't think there's any reason we wouldn't accept these sorts of
changes.

The comment from me (and others I see) is purely on the topic of making
them easier to review, i.e. splitting out "this is for a version
upgrade" v.s. "this is just better Python style" or whatever.

> Is it preferable that patch-sets have a tight focus on a single topic?
> I am already dividing up my full patch collection. I can divide it
> further if requested. I am happy to do this, I was just worried that
> it just might make longer to get all my patches through review.

Yeah this project really prefers to do it that way. A good example is
this recent 19-part series:
https://lore.kernel.org/git/20211210095757.gu7w4n2rqulx2dvg@fs/T/#m5d9e8180551907578d56cdd6cd6244b9df6b59d5

This would probably be 1-3 patches, or even 1 patch in some other
projects, but especially with repetitive formatting changes I think it's
fair to say that we prefer for them to be split up closer to that,
i.e. one commit with some %s -> {} formatting change explaining why
(probably just style, preferenc) etc.

There's also the option of splitting things into different patch
series. I'd say if you e.g. have one series of "we're dropping python 2
support" and another "here's nice formatting changes" it would be nice
to split those into two different patch serieses.

But that's always a matter of taste & how easy it is. If they
extensively textually conflict it's often worth it to just stack them
together, or if they changes are all small/easy enough to review some
"while we're at it..." changes are generally fine.

Finally, for a re-submission it's also nice to find people who've worked
on the relevant code (with some fuzzing for "is this person likely to
still be active in the project?") and CC them on the series, or in this
case people who've made recent changes to git-p4.py.

>> Some of these changes also just seem to be entirely unrelated refactorings,
>> e.g. 6/6 where you're changing a multi-line commented regexp into
>> something that's a dense one-liner. Does Python 3 not support the
>> equivalent of Perl's /x, or is something else going on here?
>
> I will improve the commit message to explain the changes being made here.
>
> The regexp is matching RCS keywords:
> https://www.perforce.com/manuals/p4guide/Content/P4Guide/filetypes.rcs.html
> - $File$, $Author$, $Author$ etc., a very simple match. We could keep
> it multi-line, though this seems overkill to me.

Sure, my preference in Perl would be to split it, but I'm never going to
be maintaining git-p4.py, so... :)

I.e. I think it's perfectly fair to roll it into some general "this
improves readability" changes, just as long as they're clearly labeled
as such.

> The main significance of this change that previously git-p4 would
> compile one of these two regexes for every single file processed. This
> patch just pre-compiles the two regexes (now binary regexes, not utf-8
> regexes) at the start of the script.

Makes sens, and another thing that would be perfect for pretty much
copy/pasting as-is to a re-rolled commit's message :)