On Tue, Jan 15, 2013 at 07:03:16PM -0500, Pete Wyckoff wrote: > john@xxxxxxxxxxxxx wrote on Tue, 15 Jan 2013 22:40 +0000: >> This is what keeping the refs as byte strings looks like. > > As John knows, it is not possible to interpret text from a byte > string without talking about the character encoding. > > Git is (largely) a C program and uses the character set defined > in the C standard, which is a subset of ASCII. But git does > "math" on strings, like this snippet that takes something from > argv[] and prepends "refs/heads/": > > strcpy(refname, "refs/heads/"); > strcpy(refname + strlen("refs/heads/"), ret->name); > > The result doesn't talk about what character set it is using, > but because it combines a prefix from ASCII with its input, > git makes the assumption that the input is ASCII-compatible. > > If you feed a UTF-16 string in argv, e.g. > > $ echo master | iconv -f ascii -t utf16 | xargs git branch > xargs: Warning: a NUL character occurred in the input. It cannot be passed through in the argument list. Did you mean to use the --null option? > fatal: Not a valid object name: ''. > > you get an error about NUL, and not the branch you hoped for. > Git assumes that the input character set contains roughly ASCII > in byte positions 0..127. > > That's one small reason why the useful character encodings put > ASCII in the 0..127 range, including utf-8, big5 and shift-jis. > ASCII is indeed special due to its legacy, and both C and Python > recognize this. > >> diff --git a/git_remote_helpers/git/importer.py b/git_remote_helpers/git/importer.py >> @@ -18,13 +18,16 @@ class GitImporter(object): >> >> def get_refs(self, gitdir): >> """Returns a dictionary with refs. >> + >> + Note that the keys in the returned dictionary are byte strings as >> + read from git. >> """ >> args = ["git", "--git-dir=" + gitdir, "for-each-ref", "refs/heads"] >> - lines = check_output(args).strip().split('\n') >> + lines = check_output(args).strip().split('\n'.encode('utf-8')) >> refs = {} >> for line in lines: >> - value, name = line.split(' ') >> - name = name.strip('commit\t') >> + value, name = line.split(' '.encode('utf-8')) >> + name = name.strip('commit\t'.encode('utf-8')) >> refs[name] = value >> return refs > > I'd suggest for this Python conundrum using byte-string literals, e.g.: > > lines = check_output(args).strip().split(b'\n') > value, name = line.split(b' ') > name = name.strip(b'commit\t') > > Essentially identical to what you have, but avoids naming "utf-8" as > the encoding. It instead relies on Python's interpretation of > ASCII characters in string context, which is exactly what C does. The problem is that AFAICT the byte-string prefix is only available in Python 2.7 and later (compare [1] and [2]). I think we need this more convoluted code if we want to keep supporting Python 2.6 (although perhaps 'ascii' would be a better choice than 'utf-8'). [1] http://docs.python.org/2.6/reference/lexical_analysis.html#literals [2] http://docs.python.org/2.7/reference/lexical_analysis.html#literals John -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html