Re: [PATCH v7 19/31] merge-recursive: add get_directory_renames()

Elijah Newren <newren@xxxxxxxxx> · Sat, 3 Feb 2018 18:04:39 -0800

On Sat, Feb 3, 2018 at 2:32 PM, Elijah Newren <newren@xxxxxxxxx> wrote:
> On Fri, Feb 2, 2018 at 5:02 PM, Stefan Beller <sbeller@xxxxxxxxxx> wrote:
>> On Tue, Jan 30, 2018 at 3:25 PM, Elijah Newren <newren@xxxxxxxxx> wrote:

>>> +       while (*--end_of_new == *--end_of_old &&
>>> +              end_of_old != old_path &&
>>> +              end_of_new != new_path)
>>> +               ; /* Do nothing; all in the while loop */
>>
>> We have to compare manually as we'd want to find
>> the first non-equal and there doesn't seem to be a good
>> library function for that.
>>
>> Assuming many repos are UTF8 (including in their paths),
>> how does this work with display characters longer than one char?
>> It should be fine as we cut at the slash?
>
> Oh, UTF-8.  Ugh.
> Can UTF-8 characters, other than '/', have a byte whose value matches
> (unsigned char)('/')?  If so, then I'll need to figure out how to do
> utf-8 character parsing.  Anyone have pointers?

Well, after digging around for a while, I found this claim on the
Wikipedia page for UTF-8:

  Since ASCII bytes do not occur when encoding non-ASCII code points
into UTF-8, UTF-8 is safe to use within most programming and document
languages that interpret certain ASCII characters in a special way,
such as "/" in filenames, "\" in escape sequences, and "%" in printf.

So, unless I'm reading something wrong here, I think that means this
code is just fine as it is.