Re: [PATCH v4 00/11] git-p4.py: Cast byte strings to unicode strings in python3

Ben Keene <seraphire@xxxxxxxxx> · Thu, 5 Dec 2019 11:16:27 -0500

On 12/5/2019 4:54 AM, Luke Diamand wrote:
On Wed, 4 Dec 2019 at 22:29, Ben Keene via GitGitGadget
<gitgitgadget@xxxxxxxxx> wrote:
Issue: The current git-p4.py script does not work with python3.

I have attempted to use the P4 integration built into GIT and I was unable
to get the program to run because I have Python 3.8 installed on my
computer. I was able to get the program to run when I downgraded my python
to version 2.7. However, python 2 is reaching its end of life.

Submission: I am submitting a patch for the git-p4.py script that partially
supports python 3.8. This code was able to pass the basic tests (t9800) when
run against Python3. This provides basic functionality.

In an attempt to pass the t9822 P4 path-encoding test, a new parameter for
git P4 Clone was introduced.

--encoding Format-identifier

This will create the GIT repository following the current functionality;
however, before importing the files from P4, it will set the
git-p4.pathEncoding option so any files or paths that are encoded with
non-ASCII/non-UTF-8 formats will import correctly.

Technical details: The script was updated by futurize (
https://python-future.org/futurize.html) to support Py2/Py3 syntax. The few
references to classes in future were reworked so that future would not be
required. The existing code test for Unicode support was extended to
normalize the classes “unicode” and “bytes” to across platforms:

  * ‘unicode’ is an alias for ‘str’ in Py3 and is the unicode class in Py2.
  * ‘bytes’ is bytes in Py3 and an alias for ‘str’ in Py2.

New coercion methods were written for both Python2 and Python3:

  * as_string(text) – In Python3, this encodes a bytes object as a UTF-8
    encoded Unicode string.
  * as_bytes(text) – In Python3, this decodes a Unicode string to an array of
    bytes.

In Python2, these functions do not change the data since a ‘str’ object
function in both roles as strings and byte arrays. This reduces the
potential impact on backward compatibility with Python 2.

  * to_unicode(text) – ensures that the supplied data is encoded as a UTF-8
    string. This function will encode data in both Python2 and Python3. *
       path_as_string(path) – This function is an extension function that
       honors the option “git-p4.pathEncoding” to convert a set of bytes or
       characters to UTF-8. If the str/bytes cannot decode as ASCII, it will
       use the encodeWithUTF8() method to convert the custom encoded bytes to
       Unicode in UTF-8.

Generally speaking, information in the script is converted to Unicode as
early as possible and converted back to a byte array just before passing to
external programs or files. The exception to this rule is P4 Repository file
paths.

Paths are not converted but left as “bytes” so the original file path
encoding can be preserved. This formatting is required for commands that
interact with the P4 file path. When the file path is used by GIT, it is
converted with encodeWithUTF8().

Almost all the tests pass now - nice!

(There's one test that fails for me, t9830-git-p4-symlink-dir.sh).

Which version of Python are running the failing test against?  I run it 
against Python 2.7 and it passes the test. I don't expect all Python 3.x 
tests to pass yet, just t9800.

Nitpicking:

- There are some bits of trailing whitespace around - can you strip
those out? You can use "git diff --check".

Is there a way that I can find out which branches I need to remove white 
space from now that they have been committed?

- Also I think the convention for git commits is that they be limited
to 72 (?) characters.

I'm going through all my commits and fixing them.

- In 10dc commit message, s/behvior/behavior
- Maybe submit 4fc4 as a separate patch series? It doesn't seem
directly related to your python3 changes.

I moved the enhancements to https://github.com/git/git/pull/675

- s/howerver/however/

The comment at line 3261 (showing the fast-import syntax) has wonky
indentation, and needs a space after the '#'.

This code looked like we're duplicating stuff:

+    if isinstance(path, unicode):
+        path = path.replace("%", "%25") \
+                   .replace("*", "%2A") \
+                   .replace("#", "%23") \
+                   .replace("@", "%40")
+    else:
+        path = path.replace(b"%", b"%25") \
+                   .replace(b"*", b"%2A") \
+                   .replace(b"#", b"%23") \
+                   .replace(b"@", b"%40")

I wonder if we can have a helper to do this?

I was just looking at this code block, and at this time, I'm not sure if 
the text coming in will be Unicode or bytes, so I'm hesitant to change 
it until more of the code is converted, but I understand about the 
duplication.

In patchRCSKeywords() you've added code to cleanup outFile. But I
wonder if we could just use a 'finally' block, or a contextexpr ("with
blah as outFile:")

I don't know if it's worth doing now that you've got it going, but at
one point I tried simplifying code like this:

    path_as_string(file['depotFile'])
and
    marshalled[b'data']

by using a dictionary with overloaded operators which would do the
bytes/string conversion automatically. However, your approach isn't
actually _that_ invasive, so maybe this is not necessary.

Looks good though, thanks!
Luke

I toyed with making a class object that would hold the path data and 
have methods to cast to bytes and encodeWithUTF8() and Unicode versions, 
but it quickly got out of hand.