Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

Jeff King <peff@xxxxxxxx> · Thu, 9 Aug 2018 10:23:34 -0400

On Wed, Aug 08, 2018 at 05:41:10PM -0700, Junio C Hamano wrote:

> > If we have an equivalence-class hashmap and feed it inodes (or again,
> > some system equivalent) as the keys, we should get buckets of
> > collisions.
> 
> I guess one way to get "some system equivalent" that can be used as
> the last resort, when there absolutely is no inum equivalent, is to
> rehash the working tree file that shouldn't be there when we detect
> a collision.
> 
> If we found that there is something when we tried to write out
> "Foo.txt", if we open "Foo.txt" on the working tree and hash-object
> it, we should find the matching blob somewhere in the index _before_
> "Foo.txt".  On a case-insensitive filesytem, it may well be
> "foo.txt", but we do not even have to know "foo.txt" and "Foo.txt"
> only differ in case.

Clever. You might still run into false positives when there is
duplicated content in the repository (especially, say, zero-length
files).  But the fact that you only do the hashing on known duplicates
helps with that.

One of the things I did like about the equivalence-class approach is
that it can be done in a single linear pass in the worst case. Whereas
anything that searches when we see a collision is quite likely to be
quadratic. But as I said before, it may not be worth worrying too much
about that for an error code path where we expect the number of
collisions to be small.

-Peff