On 8/9/2018 10:23 AM, Jeff King wrote:
On Wed, Aug 08, 2018 at 05:41:10PM -0700, Junio C Hamano wrote:
If we have an equivalence-class hashmap and feed it inodes (or again,
some system equivalent) as the keys, we should get buckets of
collisions.
I guess one way to get "some system equivalent" that can be used as
the last resort, when there absolutely is no inum equivalent, is to
rehash the working tree file that shouldn't be there when we detect
a collision.
If we found that there is something when we tried to write out
"Foo.txt", if we open "Foo.txt" on the working tree and hash-object
it, we should find the matching blob somewhere in the index _before_
"Foo.txt". On a case-insensitive filesytem, it may well be
"foo.txt", but we do not even have to know "foo.txt" and "Foo.txt"
only differ in case.
Clever. You might still run into false positives when there is
duplicated content in the repository (especially, say, zero-length
files). But the fact that you only do the hashing on known duplicates
helps with that.
I worry that the false positives make this a non-starter. I mean, if
clone creates files 'A' and 'B' (both equal) and then tries to create
'b', would the collision code reports that 'b' collided with 'A' because
that was the first OID match? Ideally with this scheme we'd have to
search the entire index prior to 'b' and then report that 'b' collided
with either 'A' or 'B'. Neither message instills confidence. And
there's no way to prefer answer 'B' over 'A' without using knowledge
of the FS name mangling/aliasing rules -- unless we want to just assume
ignore-case for this iteration.
One of the things I did like about the equivalence-class approach is
that it can be done in a single linear pass in the worst case. Whereas
anything that searches when we see a collision is quite likely to be
quadratic. But as I said before, it may not be worth worrying too much
about that for an error code path where we expect the number of
collisions to be small.
-Peff
Sorry to be paranoid, but I have an index with 3.5M entries, the word
"quadratic" rings all kinds of alarm bells for me. :-)
Granted, we expect the number of collisions to be small, but searching
back for each collision over the already-populated portion of the index
could be expensive.
Jeff