On 01/27/2013 03:50 PM, John Keeping wrote: > When this change was originally made (0846b0c - git-remote-testpy: hash > bytes explicitly , I didn't realised that the "hex" encoding we chose is > a "bytes to bytes" encoding so it just fails with an error on Python 3 > in the same way as the original code. > > It is not possible to provide a single code path that works on Python 2 > and Python 3 since Python 2.x will attempt to decode the string before > encoding it, which fails for strings that are not valid in the default > encoding. Python 3.1 introduced the "surrogateescape" error handler > which handles this correctly and permits a bytes -> unicode -> bytes > round-trip to be lossless. > > At this point Python 3.0 is unsupported so we don't go out of our way to > try to support it. > > Helped-by: Michael Haggerty <mhagger@xxxxxxxxxxxx> > Signed-off-by: John Keeping <john@xxxxxxxxxxxxx> > --- > On Sun, Jan 27, 2013 at 02:13:29PM +0000, John Keeping wrote: >> On Sun, Jan 27, 2013 at 05:44:37AM +0100, Michael Haggerty wrote: >>> So to handle all of the cases across Python versions as closely as >>> possible to the old 2.x code, it might be necessary to make the code >>> explicitly depend on the Python version number, like: >>> >>> hasher = _digest() >>> if sys.hexversion < 0x03000000: >>> pathbytes = repo.path >>> elif sys.hexversion < 0x03010000: >>> # If support for Python 3.0.x is desired (note: result can >>> # be different in this case than under 2.x or 3.1+): >>> pathbytes = repo.path.encode(sys.getfilesystemencoding(), >>> 'backslashreplace') >>> else >>> pathbytes = repo.path.encode(sys.getfilesystemencoding(), >>> 'surrogateescape') >>> hasher.update(pathbytes) >>> repo.hash = hasher.hexdigest() > > How about this? > > git-remote-testpy.py | 18 +++++++++++++++++- > 1 file changed, 17 insertions(+), 1 deletion(-) > > diff --git a/git-remote-testpy.py b/git-remote-testpy.py > index c7a04ec..16b0c52 100644 > --- a/git-remote-testpy.py > +++ b/git-remote-testpy.py > @@ -36,6 +36,22 @@ if sys.hexversion < 0x02000000: > sys.stderr.write("git-remote-testgit: requires Python 2.0 or later.\n") > sys.exit(1) > > + > +def _encode_filepath(path): > + """Encodes a Unicode file path to a byte string. > + > + On Python 2 this is a no-op; on Python 3 we encode the string as > + suggested by [1] which allows an exact round-trip from the command line > + to the filesystem. > + > + [1] http://docs.python.org/3/c-api/unicode.html#file-system-encoding > + > + """ > + if sys.hexversion < 0x03000000: > + return path > + return path.encode('utf-8', 'surrogateescape') > + > + > def get_repo(alias, url): > """Returns a git repository object initialized for usage. > """ > @@ -45,7 +61,7 @@ def get_repo(alias, url): > repo.get_head() > > hasher = _digest() > - hasher.update(repo.path.encode('hex')) > + hasher.update(_encode_filepath(repo.path)) > repo.hash = hasher.hexdigest() > > repo.get_base_path = lambda base: os.path.join( > NAK. It is still not right. If the locale is not utf-8 based, then it is incorrect to re-encode the string using utf-8. I think you really have to use sys.getfilesystemencoding() as I suggested. The attached program demonstrates the problem: the output of re-encoding using UTF-8 depends on the locale, whereas that of re-encoding using the filesystemencoding is independent of locale (as we want). The output, using Python 3.2.3: # This is 0xb6 0xc3: $ ARG="ö" $ LANG='C' /usr/bin/python3 chaos3.py "$ARG" LANG = 'C' fse = 'ascii' sys.argv[1] = u"U+DCC3 U+DCB6" re-encoded using UTF-8: b"C3 B6" re-encoded using fse: b"C3 B6" $ LANG='C.UTF-8' /usr/bin/python3 chaos3.py "$ARG" LANG = 'C.UTF-8' fse = 'utf-8' sys.argv[1] = u"U+00F6" re-encoded using UTF-8: b"C3 B6" re-encoded using fse: b"C3 B6" $ LANG='en_US.iso88591' /usr/bin/python3 chaos3.py "$ARG" LANG = 'en_US.iso88591' fse = 'iso8859-1' sys.argv[1] = u"U+00C3 U+00B6" re-encoded using UTF-8: b"C3 83 C2 B6" re-encoded using fse: b"C3 B6" Even though the Unicode intermediate representation is different for UTF-8 and ASCII, re-encoding using the correct encoding gives back the original bytes (which is what we want). But when using the ios8859-1 locale, the original bytes look like a valid latin1 string so they are not surrogated going in, giving the incorrect Unicode string u"U+00C3 U+00B6". When this is re-encoded using UTF-8, the code points U+00C3 and U+00B6 are each encoded as two bytes. Michael -- Michael Haggerty mhagger@xxxxxxxxxxxx http://softwareswirl.blogspot.com/
#! /usr/bin/python3 import sys import os def explicit(s): """Convert a string or bytestring into an unambiguous human-readable string.""" if isinstance(s, str): return 'u"%s"' % (' '.join('U+%04X' % (ord(c),) for c in s)) else: return 'b"%s"' % (' '.join('%02X' % (c,) for c in s)) fse = sys.getfilesystemencoding() print('LANG = %r' % (os.getenv('LANG'),)) print('fse = %r' % (fse,)) print('sys.argv[1] = %s' % explicit(sys.argv[1])) print('re-encoded using UTF-8: %s' % explicit(sys.argv[1].encode('utf-8', 'surrogateescape'))) print('re-encoded using fse: %s' % explicit(sys.argv[1].encode(fse, 'surrogateescape'))) print()