Re: [PATCH 2/8] git_remote_helpers: fix input when running under Python 3

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jan 14, 2013 at 05:48:18AM +0100, Michael Haggerty wrote:
> On 01/13/2013 05:17 PM, John Keeping wrote:
>> On Sun, Jan 13, 2013 at 04:26:39AM +0100, Michael Haggerty wrote:
>>> On 01/12/2013 08:23 PM, John Keeping wrote:
>>>> Although 2to3 will fix most issues in Python 2 code to make it run under
>>>> Python 3, it does not handle the new strict separation between byte
>>>> strings and unicode strings.  There is one instance in
>>>> git_remote_helpers where we are caught by this.
>>>>
>>>> Fix it by explicitly decoding the incoming byte string into a unicode
>>>> string.  In this instance, use the locale under which the application is
>>>> running.
>>>>
>>>> Signed-off-by: John Keeping <john@xxxxxxxxxxxxx>
>>>> ---
>>>>  git_remote_helpers/git/importer.py | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/git_remote_helpers/git/importer.py b/git_remote_helpers/git/importer.py
>>>> index e28cc8f..6814003 100644
>>>> --- a/git_remote_helpers/git/importer.py
>>>> +++ b/git_remote_helpers/git/importer.py
>>>> @@ -20,7 +20,7 @@ class GitImporter(object):
>>>>          """Returns a dictionary with refs.
>>>>          """
>>>>          args = ["git", "--git-dir=" + gitdir, "for-each-ref", "refs/heads"]
>>>> -        lines = check_output(args).strip().split('\n')
>>>> +        lines = check_output(args).decode().strip().split('\n')
>>>>          refs = {}
>>>>          for line in lines:
>>>>              value, name = line.split(' ')
>>>>
>>>
>>> Won't this change cause an exception if the branch names are not all
>>> valid strings in the current locale's encoding?  I don't see how this
>>> assumption is justified (e.g., see git-check-ref-format(1) for the rules
>>> governing reference names).
>> 
>> Yes it will.  The problem is that for Python 3 we need to decode the
>> byte string into a unicode string, which means we need to know what
>> encoding it is.
>> 
>> I don't think we can just say "git-for-each-ref will print refs in
>> UTF-8" since AFAIK git doesn't care what encoding the refs are in - I
>> suspect that's determined by the filesystem which in the end probably
>> maps to whatever bytes the shell fed git when the ref was created.
>> 
>> That's why I chose the current locale in this case.  I'm hoping someone
>> here will correct me if we can do better, but I don't see any way of
>> avoiding choosing some encoding here if we want to support Python 3
>> (which I think we will, even if we don't right now).
> 
> I'm not just trying to be a nuisance here;

You're not being - I think this is a difficult issue and I don't know
myself what the right answer is.

>                                            I'm struggling myself to
> understand how a program that cares about strings-vs-bytes (e.g., a
> Python3 script) should coexist with a program that doesn't (e.g., git
> [1]).  I think this will become a big issue if my Python version of the
> commit email script ever gets integrated and then made compatible with
> Python3.
> 
> You claim "for Python 3 we need to decode the byte string into a unicode
> string".  I understand that Python 3 strings are Unicode, but why/when
> is it necessary to decode data into a Unicode string as opposed to
> leaving it as a byte sequence?
> 
> In this particular case (from a cursory look over the code) it seems to
> me that (1) decoding to Unicode will sometimes fail for data that git
> considers valid and (2) there is no obvious reason that the data cannot
> be processed as byte sequences.

I've been thinking about this overnight and I think you're right that
treating them as byte strings in Python is most correct.  Having said
that, when I'm programming in Python I would find it quite surprising
that I had to treat ref strings specially - and as soon as I want to use
one in a string context (e.g. printing it as part of a message to the
user) I'm back to the same problem.

So I think we should try to solve the problem once rather than forcing
everyone who wants to use the library to solve it individually.  I just
wish it was obvious what we should do!


John
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]