Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

Jeff King <peff@xxxxxxxx> · Thu, 9 Aug 2018 17:34:09 -0400

On Thu, Aug 09, 2018 at 05:14:16PM -0400, Jeff Hostetler wrote:

> > Clever. You might still run into false positives when there is
> > duplicated content in the repository (especially, say, zero-length
> > files).  But the fact that you only do the hashing on known duplicates
> > helps with that.
> > 
> 
> I worry that the false positives make this a non-starter.  I mean, if
> clone creates files 'A' and 'B' (both equal) and then tries to create
> 'b', would the collision code reports that 'b' collided with 'A' because
> that was the first OID match?  Ideally with this scheme we'd have to
> search the entire index prior to 'b' and then report that 'b' collided
> with either 'A' or 'B'.  Neither message instills confidence.  And
> there's no way to prefer answer 'B' over 'A' without using knowledge
> of the FS name mangling/aliasing rules -- unless we want to just assume
> ignore-case for this iteration.

Yeah. If we can get usable unique ids (inode or otherwise) of some form
on each system, I think I prefer that. It's much easier to reason about.

> Sorry to be paranoid, but I have an index with 3.5M entries, the word
> "quadratic" rings all kinds of alarm bells for me.  :-)
> 
> Granted, we expect the number of collisions to be small, but searching
> back for each collision over the already-populated portion of the index
> could be expensive.

Heh. Yeah, it's really O(n*m), and we expect a small "m". But of course
the two are equal in the worst case.

-Peff