Re: [RFC/PATCH 2/8 v3] git_remote_helpers: fix input when running under Python 3

John Keeping <john@xxxxxxxxxxxxx> · Wed, 16 Jan 2013 09:45:34 +0000

On Tue, Jan 15, 2013 at 07:03:16PM -0500, Pete Wyckoff wrote:
> john@xxxxxxxxxxxxx wrote on Tue, 15 Jan 2013 22:40 +0000:
>> This is what keeping the refs as byte strings looks like.
> 
> As John knows, it is not possible to interpret text from a byte
> string without talking about the character encoding.
> 
> Git is (largely) a C program and uses the character set defined
> in the C standard, which is a subset of ASCII.  But git does
> "math" on strings, like this snippet that takes something from
> argv[] and prepends "refs/heads/":
> 
>     strcpy(refname, "refs/heads/");
>     strcpy(refname + strlen("refs/heads/"), ret->name);
> 
> The result doesn't talk about what character set it is using,
> but because it combines a prefix from ASCII with its input,
> git makes the assumption that the input is ASCII-compatible.
> 
> If you feed a UTF-16 string in argv, e.g.
> 
>     $ echo master | iconv -f ascii -t utf16 | xargs git branch
>     xargs: Warning: a NUL character occurred in the input.  It cannot be passed through in the argument list.  Did you mean to use the --null option?
>     fatal: Not a valid object name: ''.
> 
> you get an error about NUL, and not the branch you hoped for.
> Git assumes that the input character set contains roughly ASCII
> in byte positions 0..127.
> 
> That's one small reason why the useful character encodings put
> ASCII in the 0..127 range, including utf-8, big5 and shift-jis.
> ASCII is indeed special due to its legacy, and both C and Python
> recognize this.
> 
>> diff --git a/git_remote_helpers/git/importer.py b/git_remote_helpers/git/importer.py
>> @@ -18,13 +18,16 @@ class GitImporter(object):
>>  
>>      def get_refs(self, gitdir):
>>          """Returns a dictionary with refs.
>> +
>> +        Note that the keys in the returned dictionary are byte strings as
>> +        read from git.
>>          """
>>          args = ["git", "--git-dir=" + gitdir, "for-each-ref", "refs/heads"]
>> -        lines = check_output(args).strip().split('\n')
>> +        lines = check_output(args).strip().split('\n'.encode('utf-8'))
>>          refs = {}
>>          for line in lines:
>> -            value, name = line.split(' ')
>> -            name = name.strip('commit\t')
>> +            value, name = line.split(' '.encode('utf-8'))
>> +            name = name.strip('commit\t'.encode('utf-8'))
>>              refs[name] = value
>>          return refs
> 
> I'd suggest for this Python conundrum using byte-string literals, e.g.:
> 
>         lines = check_output(args).strip().split(b'\n')
> 	value, name = line.split(b' ')
> 	name = name.strip(b'commit\t')
> 
> Essentially identical to what you have, but avoids naming "utf-8" as
> the encoding.  It instead relies on Python's interpretation of
> ASCII characters in string context, which is exactly what C does.

The problem is that AFAICT the byte-string prefix is only available in
Python 2.7 and later (compare [1] and [2]).  I think we need this more
convoluted code if we want to keep supporting Python 2.6 (although
perhaps 'ascii' would be a better choice than 'utf-8').

[1] http://docs.python.org/2.6/reference/lexical_analysis.html#literals
[2] http://docs.python.org/2.7/reference/lexical_analysis.html#literals

John
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html