Re: [RFC/PATCH 2/8 v3] git_remote_helpers: fix input when running under Python 3

Pete Wyckoff <pw@xxxxxxxx> · Tue, 15 Jan 2013 19:03:16 -0500

john@xxxxxxxxxxxxx wrote on Tue, 15 Jan 2013 22:40 +0000:
> This is what keeping the refs as byte strings looks like.

As John knows, it is not possible to interpret text from a byte
string without talking about the character encoding.

Git is (largely) a C program and uses the character set defined
in the C standard, which is a subset of ASCII.  But git does
"math" on strings, like this snippet that takes something from
argv[] and prepends "refs/heads/":

    strcpy(refname, "refs/heads/");
    strcpy(refname + strlen("refs/heads/"), ret->name);

The result doesn't talk about what character set it is using,
but because it combines a prefix from ASCII with its input,
git makes the assumption that the input is ASCII-compatible.

If you feed a UTF-16 string in argv, e.g.

    $ echo master | iconv -f ascii -t utf16 | xargs git branch
    xargs: Warning: a NUL character occurred in the input.  It cannot be passed through in the argument list.  Did you mean to use the --null option?
    fatal: Not a valid object name: ''.

you get an error about NUL, and not the branch you hoped for.
Git assumes that the input character set contains roughly ASCII
in byte positions 0..127.

That's one small reason why the useful character encodings put
ASCII in the 0..127 range, including utf-8, big5 and shift-jis.
ASCII is indeed special due to its legacy, and both C and Python
recognize this.

> diff --git a/git_remote_helpers/git/importer.py b/git_remote_helpers/git/importer.py
> @@ -18,13 +18,16 @@ class GitImporter(object):
>  
>      def get_refs(self, gitdir):
>          """Returns a dictionary with refs.
> +
> +        Note that the keys in the returned dictionary are byte strings as
> +        read from git.
>          """
>          args = ["git", "--git-dir=" + gitdir, "for-each-ref", "refs/heads"]
> -        lines = check_output(args).strip().split('\n')
> +        lines = check_output(args).strip().split('\n'.encode('utf-8'))
>          refs = {}
>          for line in lines:
> -            value, name = line.split(' ')
> -            name = name.strip('commit\t')
> +            value, name = line.split(' '.encode('utf-8'))
> +            name = name.strip('commit\t'.encode('utf-8'))
>              refs[name] = value
>          return refs

I'd suggest for this Python conundrum using byte-string literals, e.g.:

        lines = check_output(args).strip().split(b'\n')
	value, name = line.split(b' ')
	name = name.strip(b'commit\t')

Essentially identical to what you have, but avoids naming "utf-8" as
the encoding.  It instead relies on Python's interpretation of
ASCII characters in string context, which is exactly what C does.

		-- Pete
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html