Re: [PATCH] t3910: show failure of core.precomposeunicode with decomposed filenames

Jeff King <peff@xxxxxxxx> · Tue, 29 Apr 2014 14:02:10 -0400

On Tue, Apr 29, 2014 at 10:12:52AM -0700, Junio C Hamano wrote:

> Jeff King <peff@xxxxxxxx> writes:
> 
> > This patch just adds a test to demonstrate the breakage.
> > Some possible fixes are:
> >
> >   1. Tell everyone that NFD in the git repo is wrong, and
> >      they should make a new commit to normalize all their
> >      in-repo files to be precomposed.
> >
> >      This is probably not the right thing to do, because it
> >      still doesn't fix checkouts of old history. And it
> >      spreads the problem to people on byte-preserving
> >      filesystems (like ext4), because now they have to start
> >      precomposing their filenames as they are adde to git.
> 
> Hmm, have we taught the "compare precomposed" for codepaths that
> compare two trees and a tree and the index, too?  Otherwise, we
> would have the same issue with commits in the old history.

Ugh, yeah, I didn't think about that codepath. I think we would not want
to precompose in that case. IOW, git works byte-wise internally, but it
is only at the filesystem layer that we do such munging. The index
straddles the line between the filesystem and git's internal
representations.

I think my "keep the normalized names alongside index entries" approach
might still work there. But it means that we compare against the "real"
byte-wise names on the tree side, and against the normalized names on
the path side. But that means having two comparison/lookup functions for
the index, and always using the right one. And algorithms that rely on
traversing two sorted lists cannot work in both directions.

> Do we have a similar issue for older commit in a history under
> "ignore-case" as well?

I don't think so, because we handle ignorecase completely differently.
There we use the name-hash with a case-insensitive hash and a
case-insensitive comparison function. And we use strcasecmp liberally
throughout the code.

I don't think we have a "str_utf8_cmp" that ignores normalizations (or
maybe strcoll will do this?). But in theory we could use it everywhere
we use strcasecmp for ignore_case. And then we would not need to have
our readdir wrapper, maybe? I admit I haven't thought that much about
_either_ approach. But aside from some bugs in the hash system, I do not
recall seeing any design problems in the ignorecase code.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html