On Thu, Aug 09, 2018 at 05:14:16PM -0400, Jeff Hostetler wrote: > > Clever. You might still run into false positives when there is > > duplicated content in the repository (especially, say, zero-length > > files). But the fact that you only do the hashing on known duplicates > > helps with that. > > > > I worry that the false positives make this a non-starter. I mean, if > clone creates files 'A' and 'B' (both equal) and then tries to create > 'b', would the collision code reports that 'b' collided with 'A' because > that was the first OID match? Ideally with this scheme we'd have to > search the entire index prior to 'b' and then report that 'b' collided > with either 'A' or 'B'. Neither message instills confidence. And > there's no way to prefer answer 'B' over 'A' without using knowledge > of the FS name mangling/aliasing rules -- unless we want to just assume > ignore-case for this iteration. Yeah. If we can get usable unique ids (inode or otherwise) of some form on each system, I think I prefer that. It's much easier to reason about. > Sorry to be paranoid, but I have an index with 3.5M entries, the word > "quadratic" rings all kinds of alarm bells for me. :-) > > Granted, we expect the number of collisions to be small, but searching > back for each collision over the already-populated portion of the index > could be expensive. Heh. Yeah, it's really O(n*m), and we expect a small "m". But of course the two are equal in the worst case. -Peff