Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

Jeff King <peff@xxxxxxxx> · Fri, 3 Aug 2018 14:53:26 -0400

On Fri, Aug 03, 2018 at 02:23:17PM -0400, Jeff Hostetler wrote:

> > Maybe. It might not work as ino_t. Or it might be expensive to get.  Or
> > maybe it's simply impossible. I don't know much about Windows. Some
> > searching implies that NTFS does have a "file index" concept which is
> > supposed to be unique.
> 
> This is hard and/or expensive on Windows.  Yes, you can get the
> "file index" values for an open file handle with a cost similar to
> an fstat().  Unfortunately, the FindFirst/FindNext routines (equivalent
> to the opendir/readdir routines), don't give you that data.  So we'd
> have to scan the directory and then open and stat each file.  This is
> terribly expensive on Windows -- and the reason we have the fscache
> layer (in the GfW version) to intercept the lstat() calls whenever
> possible.

I think that high cost might be OK for our purposes here. This code
would _only_ kick in during a clone, and then only on the error path
once we knew we had a collision during the checkout step.

> Another thing to keep in mind is that the collision could be because
> of case folding (or other such nonsense) on a directory in the path.
> I mean, if someone on Linux builds a commit containing:
> 
>     a/b/c/D/e/foo.txt
>     a/b/c/d/e/foo.txt
> 
> we'll get a similar collision as if one of them were spelled "FOO.txt".

True, though I think that may be OK. If you had conflicting directories
you'd get a _ton_ of duplicates listed, but that makes sense: you
actually have a ton of duplicates.

> Also, do we need to worry about hard-links or symlinks here?

I think we can ignore hardlinks. Git never creates them, and we know the
directory was empty when we started. Symlinks should be handled by using
lstat(). (Obviously that's for a Unix-ish platform).

> I'm sure there are other edge cases here that make reporting
> difficult; these are just a few I thought of.  I guess what I'm
> trying to say is that as a first step just report that you found
> a collision -- without trying to identify the set existing objects
> that it collided with.

I certainly don't disagree with that. :)

> > At any rate, until we have an actual plan for Windows, I think it would
> > make sense only to split the cases into "has working inodes" and
> > "other", and make sure "other" does something sensible in the meantime
> > (like mention the conflict, but skip trying to list duplicates).
> 
> Yes, this should be split.  Do the "easy" Linux version first.
> Keep in mind that there may also be a different solution for the Mac.

I assumed that an inode-based solution would work for Mac, since it's
mostly BSD under the hood. There may be subtleties I don't know about,
though.

-Peff