On Wed, Aug 08, 2018 at 05:41:10PM -0700, Junio C Hamano wrote: > > If we have an equivalence-class hashmap and feed it inodes (or again, > > some system equivalent) as the keys, we should get buckets of > > collisions. > > I guess one way to get "some system equivalent" that can be used as > the last resort, when there absolutely is no inum equivalent, is to > rehash the working tree file that shouldn't be there when we detect > a collision. > > If we found that there is something when we tried to write out > "Foo.txt", if we open "Foo.txt" on the working tree and hash-object > it, we should find the matching blob somewhere in the index _before_ > "Foo.txt". On a case-insensitive filesytem, it may well be > "foo.txt", but we do not even have to know "foo.txt" and "Foo.txt" > only differ in case. Clever. You might still run into false positives when there is duplicated content in the repository (especially, say, zero-length files). But the fact that you only do the hashing on known duplicates helps with that. One of the things I did like about the equivalence-class approach is that it can be done in a single linear pass in the worst case. Whereas anything that searches when we see a collision is quite likely to be quadratic. But as I said before, it may not be worth worrying too much about that for an error code path where we expect the number of collisions to be small. -Peff