On Sat, Feb 3, 2018 at 9:04 PM, Elijah Newren <newren@xxxxxxxxx> wrote: > On Sat, Feb 3, 2018 at 2:32 PM, Elijah Newren <newren@xxxxxxxxx> wrote: >> On Fri, Feb 2, 2018 at 5:02 PM, Stefan Beller <sbeller@xxxxxxxxxx> wrote: >>> On Tue, Jan 30, 2018 at 3:25 PM, Elijah Newren <newren@xxxxxxxxx> wrote: >>>> + while (*--end_of_new == *--end_of_old && >>>> + end_of_old != old_path && >>>> + end_of_new != new_path) >>>> + ; /* Do nothing; all in the while loop */ >>> >>> Assuming many repos are UTF8 (including in their paths), >>> how does this work with display characters longer than one char? >>> It should be fine as we cut at the slash? >> >> Can UTF-8 characters, other than '/', have a byte whose value matches >> (unsigned char)('/')? If so, then I'll need to figure out how to do >> utf-8 character parsing. Anyone have pointers? > > Well, after digging around for a while, I found this claim on the > Wikipedia page for UTF-8: > > Since ASCII bytes do not occur when encoding non-ASCII code points > into UTF-8, UTF-8 is safe to use within most programming and document > languages that interpret certain ASCII characters in a special way, > such as "/" in filenames, "\" in escape sequences, and "%" in printf. > > So, unless I'm reading something wrong here, I think that means this > code is just fine as it is. You're reading it correctly. Unicode values greater than \U007f encoded with UTF-8 will never contain bytes which can be confused with any 7-bit ASCII character. It's possible that Stefan was thinking of "combining characters"[1] which may be "precomposed" and "decomposed"[2], but which appear the same when rendered. For instance, "ö" might be a single Unicode codepoint or two codepoints, such as "o" combined with a diaeresis. It's a potential problem if you're comparing, byte by byte, two filenames which look the same. However, Git takes pains[3] to avoid this problem by ensuring (if possible) that filenames are precomposed within Git even if they happen to be decomposed on the actual filesystem. So, most likely, your code is okay as-is. [1]: https://en.wikipedia.org/wiki/Combining_character [2]: https://en.wikipedia.org/wiki/Diaeresis_(diacritic) [3]: https://github.com/git/git/blob/master/compat/precompose_utf8.c