Re: [PATCH v7 19/31] merge-recursive: add get_directory_renames()

Eric Sunshine <sunshine@xxxxxxxxxxxxx> · Sat, 3 Feb 2018 23:42:58 -0500

On Sat, Feb 3, 2018 at 9:04 PM, Elijah Newren <newren@xxxxxxxxx> wrote:
> On Sat, Feb 3, 2018 at 2:32 PM, Elijah Newren <newren@xxxxxxxxx> wrote:
>> On Fri, Feb 2, 2018 at 5:02 PM, Stefan Beller <sbeller@xxxxxxxxxx> wrote:
>>> On Tue, Jan 30, 2018 at 3:25 PM, Elijah Newren <newren@xxxxxxxxx> wrote:
>>>> +       while (*--end_of_new == *--end_of_old &&
>>>> +              end_of_old != old_path &&
>>>> +              end_of_new != new_path)
>>>> +               ; /* Do nothing; all in the while loop */
>>>
>>> Assuming many repos are UTF8 (including in their paths),
>>> how does this work with display characters longer than one char?
>>> It should be fine as we cut at the slash?
>>
>> Can UTF-8 characters, other than '/', have a byte whose value matches
>> (unsigned char)('/')?  If so, then I'll need to figure out how to do
>> utf-8 character parsing.  Anyone have pointers?
>
> Well, after digging around for a while, I found this claim on the
> Wikipedia page for UTF-8:
>
>   Since ASCII bytes do not occur when encoding non-ASCII code points
> into UTF-8, UTF-8 is safe to use within most programming and document
> languages that interpret certain ASCII characters in a special way,
> such as "/" in filenames, "\" in escape sequences, and "%" in printf.
>
> So, unless I'm reading something wrong here, I think that means this
> code is just fine as it is.

You're reading it correctly. Unicode values greater than \U007f
encoded with UTF-8 will never contain bytes which can be confused with
any 7-bit ASCII character.

It's possible that Stefan was thinking of "combining characters"[1]
which may be "precomposed" and "decomposed"[2], but which appear the
same when rendered. For instance, "ö" might be a single Unicode
codepoint or two codepoints, such as "o" combined with a diaeresis.
It's a potential problem if you're comparing, byte by byte, two
filenames which look the same. However, Git takes pains[3] to avoid
this problem by ensuring (if possible) that filenames are precomposed
within Git even if they happen to be decomposed on the actual
filesystem. So, most likely, your code is okay as-is.

[1]: https://en.wikipedia.org/wiki/Combining_character
[2]: https://en.wikipedia.org/wiki/Diaeresis_(diacritic)
[3]: https://github.com/git/git/blob/master/compat/precompose_utf8.c