Re: [PATCH] git-remote-testpy: fix patch hashing on Python 3

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Mon, 28 Jan 2013 11:44:34 +0100

On 01/27/2013 03:50 PM, John Keeping wrote:
> When this change was originally made (0846b0c - git-remote-testpy: hash
> bytes explicitly , I didn't realised that the "hex" encoding we chose is
> a "bytes to bytes" encoding so it just fails with an error on Python 3
> in the same way as the original code.
> 
> It is not possible to provide a single code path that works on Python 2
> and Python 3 since Python 2.x will attempt to decode the string before
> encoding it, which fails for strings that are not valid in the default
> encoding.  Python 3.1 introduced the "surrogateescape" error handler
> which handles this correctly and permits a bytes -> unicode -> bytes
> round-trip to be lossless.
> 
> At this point Python 3.0 is unsupported so we don't go out of our way to
> try to support it.
> 
> Helped-by: Michael Haggerty <mhagger@xxxxxxxxxxxx>
> Signed-off-by: John Keeping <john@xxxxxxxxxxxxx>
> ---
> On Sun, Jan 27, 2013 at 02:13:29PM +0000, John Keeping wrote:
>> On Sun, Jan 27, 2013 at 05:44:37AM +0100, Michael Haggerty wrote:
>>> So to handle all of the cases across Python versions as closely as
>>> possible to the old 2.x code, it might be necessary to make the code
>>> explicitly depend on the Python version number, like:
>>>
>>>     hasher = _digest()
>>>     if sys.hexversion < 0x03000000:
>>>         pathbytes = repo.path
>>>     elif sys.hexversion < 0x03010000:
>>>         # If support for Python 3.0.x is desired (note: result can
>>>         # be different in this case than under 2.x or 3.1+):
>>>         pathbytes = repo.path.encode(sys.getfilesystemencoding(),
>>> 'backslashreplace')
>>>     else
>>>         pathbytes = repo.path.encode(sys.getfilesystemencoding(),
>>> 'surrogateescape')
>>>     hasher.update(pathbytes)
>>>     repo.hash = hasher.hexdigest()
> 
> How about this?
> 
>  git-remote-testpy.py | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/git-remote-testpy.py b/git-remote-testpy.py
> index c7a04ec..16b0c52 100644
> --- a/git-remote-testpy.py
> +++ b/git-remote-testpy.py
> @@ -36,6 +36,22 @@ if sys.hexversion < 0x02000000:
>      sys.stderr.write("git-remote-testgit: requires Python 2.0 or later.\n")
>      sys.exit(1)
>  
> +
> +def _encode_filepath(path):
> +    """Encodes a Unicode file path to a byte string.
> +
> +    On Python 2 this is a no-op; on Python 3 we encode the string as
> +    suggested by [1] which allows an exact round-trip from the command line
> +    to the filesystem.
> +
> +    [1] http://docs.python.org/3/c-api/unicode.html#file-system-encoding
> +
> +    """
> +    if sys.hexversion < 0x03000000:
> +        return path
> +    return path.encode('utf-8', 'surrogateescape')
> +
> +
>  def get_repo(alias, url):
>      """Returns a git repository object initialized for usage.
>      """
> @@ -45,7 +61,7 @@ def get_repo(alias, url):
>      repo.get_head()
>  
>      hasher = _digest()
> -    hasher.update(repo.path.encode('hex'))
> +    hasher.update(_encode_filepath(repo.path))
>      repo.hash = hasher.hexdigest()
>  
>      repo.get_base_path = lambda base: os.path.join(
> 

NAK.  It is still not right.  If the locale is not utf-8 based, then it
is incorrect to re-encode the string using utf-8.  I think you really
have to use sys.getfilesystemencoding() as I suggested.

The attached program demonstrates the problem: the output of re-encoding
using UTF-8 depends on the locale, whereas that of re-encoding using the
filesystemencoding is independent of locale (as we want).  The output,
using Python 3.2.3:

# This is 0xb6 0xc3:
$ ARG="ö"
$ LANG='C' /usr/bin/python3 chaos3.py "$ARG"
LANG = 'C'
fse = 'ascii'
sys.argv[1] = u"U+DCC3 U+DCB6"
re-encoded using UTF-8: b"C3 B6"
re-encoded using fse: b"C3 B6"

$ LANG='C.UTF-8' /usr/bin/python3 chaos3.py "$ARG"
LANG = 'C.UTF-8'
fse = 'utf-8'
sys.argv[1] = u"U+00F6"
re-encoded using UTF-8: b"C3 B6"
re-encoded using fse: b"C3 B6"

$ LANG='en_US.iso88591' /usr/bin/python3 chaos3.py "$ARG"
LANG = 'en_US.iso88591'
fse = 'iso8859-1'
sys.argv[1] = u"U+00C3 U+00B6"
re-encoded using UTF-8: b"C3 83 C2 B6"
re-encoded using fse: b"C3 B6"

Even though the Unicode intermediate representation is different for
UTF-8 and ASCII, re-encoding using the correct encoding gives back the
original bytes (which is what we want).  But when using the ios8859-1
locale, the original bytes look like a valid latin1 string so they are
not surrogated going in, giving the incorrect Unicode string u"U+00C3
U+00B6".  When this is re-encoded using UTF-8, the code points U+00C3
and U+00B6 are each encoded as two bytes.

Michael

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx
http://softwareswirl.blogspot.com/
#! /usr/bin/python3

import sys
import os

def explicit(s):
    """Convert a string or bytestring into an unambiguous human-readable string."""

    if isinstance(s, str):
        return 'u"%s"' % (' '.join('U+%04X' % (ord(c),) for c in s))
    else:
        return 'b"%s"' % (' '.join('%02X' % (c,) for c in s))

fse = sys.getfilesystemencoding()

print('LANG = %r' % (os.getenv('LANG'),))
print('fse = %r' % (fse,))
print('sys.argv[1] = %s' % explicit(sys.argv[1]))
print('re-encoded using UTF-8: %s' % explicit(sys.argv[1].encode('utf-8', 'surrogateescape')))
print('re-encoded using fse: %s' % explicit(sys.argv[1].encode(fse, 'surrogateescape')))
print()