Re: Git clone and case sensitivity

Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> · Tue, 31 Jul 2018 15:39:41 -0400

On 7/29/2018 5:28 AM, Jeff King wrote:
On Sun, Jul 29, 2018 at 07:26:41AM +0200, Duy Nguyen wrote:

strcasecmp() will only catch a subset of the cases. We really need to
follow the same folding rules that the filesystem would.

True. But that's how we handle case insensitivity internally. If a
filesytem has more sophisticated folding rules then git will not work
well on that one anyway.

Hrm. Yeah, I guess that's the best we can do for the actual in-memory
checks. Everything else depends on doing an actual filesystem operation,
and our icase stuff kicks in way before then. I was mostly thinking of
HFS+ utf8 normalization weirdness, but I guess people are accustomed to
that by now.

For the case of clone, I actually wonder if we could detect during the
checkout step that a file already exists. Since we know that the
directory we started with was empty, then if it does, either:

   - there's some funny case-folding going on that means two paths in the
     repository map to the same name in the filesystem; or

   - somebody else is writing to the directory at the same time as us

This is exactly what my first patch does (minus the sparse checkout
part).

Right, sorry, I should have read that one more carefully.

But without knowing the exact folding rules, I don't think we can
locate this "somebody else" who wrote the first path. So if N paths
are treated the same by this filesystem, we could only report N-1 of
them.

If we want to report just one path when this happens though, then this
works quite well.

Hmm. Since most such systems are case-preserving, would it be possible
to report the name of the existing file? Doing it via opendir/readdir is
hacky, and anyway puts the burden on us to find the matching name. Doing
it via fstat() on the opened file doesn't work because at that the
filesystem has resolved the name to an inode.

So yeah, perhaps strcasecmp() is the best we can do (I do agree that
being able to mention all of the conflicting names is a benefit).

I guess we should be using fspathcmp(), though, in case it later learns
to be smarter.

-Peff

As has already been mentioned, this gets into weird territory really
fast, between case folding, final space/dot on windows, utf8 NFC/NFD
weirdness on the mac, utf8 invisible chars on the mac, long/short names
on windows, and etc.

And that's just for filenames.  Things really get weird if directory
names have these ambiguities.

Perhaps just print the problematic paths (where the collision is
detected) and let the user decide how to correct them.

Perhaps we could have a separate tool that could scan the index or
commit for potential conflicts and warn them in advance (granted, it
might not be perfect and may report a few false positives).

Forcing them into a sparse-checkout situation might be over their
skill level.

Jeff