Re: [PATCH/RFC] clone: report duplicate entries on case-insensitive filesystems

Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> · Fri, 3 Aug 2018 14:23:17 -0400

On 8/2/2018 5:28 PM, Jeff King wrote:
On Thu, Aug 02, 2018 at 02:14:30PM -0700, Junio C Hamano wrote:

Jeff King <peff@xxxxxxxx> writes:

I also wonder if Windows could return some other file-unique identifier
that would work in place of an inode here. That would be pretty easy to
swap in via an #ifdef's helper function. I'd be OK shipping without that
and letting Windows folks fill it in later (as long as we do not do
anything too stupid until then, like claim all of the inode==0 files are
the same).

Yeah, but such a useful file-unique identifier would probably be
used in place of inum in their (l)stat emulation already, if exists,
no?

Maybe. It might not work as ino_t. Or it might be expensive to get.  Or
maybe it's simply impossible. I don't know much about Windows. Some
searching implies that NTFS does have a "file index" concept which is
supposed to be unique.

This is hard and/or expensive on Windows.  Yes, you can get the
"file index" values for an open file handle with a cost similar to
an fstat().  Unfortunately, the FindFirst/FindNext routines (equivalent
to the opendir/readdir routines), don't give you that data.  So we'd
have to scan the directory and then open and stat each file.  This is
terribly expensive on Windows -- and the reason we have the fscache
layer (in the GfW version) to intercept the lstat() calls whenever
possible.

It might be possible to use the NTFS Master File Table to discover
this (very big handwave), but I would need to do a little digging.

This would all be NTFS specific.  FAT and other volume types would not
be covered.

Another thing to keep in mind is that the collision could be because
of case folding (or other such nonsense) on a directory in the path.
I mean, if someone on Linux builds a commit containing:

    a/b/c/D/e/foo.txt
    a/b/c/d/e/foo.txt

we'll get a similar collision as if one of them were spelled "FOO.txt".

Also, do we need to worry about hard-links or symlinks here?
If checkout populates symlinks, then you might have another collision
opportunity.  For example:

    a/b/c/D/e/foo.txt
    a/link -> ./b/c/d
    a/link/e/foo.txt

Also, some platforms (like the Mac) allow directory hard-links.
Granted, Git doesn't create hard-links during checkout, but the
user might.

I'm sure there are other edge cases here that make reporting
difficult; these are just a few I thought of.  I guess what I'm
trying to say is that as a first step just report that you found
a collision -- without trying to identify the set existing objects
that it collided with.

At any rate, until we have an actual plan for Windows, I think it would
make sense only to split the cases into "has working inodes" and
"other", and make sure "other" does something sensible in the meantime
(like mention the conflict, but skip trying to list duplicates).

Yes, this should be split.  Do the "easy" Linux version first.
Keep in mind that there may also be a different solution for the Mac.

When somebody wants to work on Windows support, then we can figure out
if it just needs to wrap the "get unique identifier" operation, or if it
would use a totally different algorithm.

-Peff

Jeff