Jeff King <peff@xxxxxxxx> writes: > On Tue, Oct 02, 2007 at 06:31:18PM +0200, David Kastrup wrote: > >> This does not actually require an actual merge _sort_ AFAICS: do the >> "sort file.hashed" step using qsort. The comparison step does not >> actually need to produce merged output, but merely advances through >> two hash arrays and generates statistics. > > Right, that's why I used "merge" in quotes. > >> This should already beat the pants off the current implementation, >> even when the hash array is sparse, simply because our inner loop >> then has perfect hash coherence. > > Yes, I hope so. We'll see. :) > >> Getting rid of this outer O(n^2) remains an interesting challenge, >> though. One way would be the following: fill a _single_ array with >> entries containing _both_ hash and file number. Sort this, and >> then gather the statistics of hash runs by making a single pass >> through. That reduces the O(n^2) behavior to only those parts with >> actual hash collisions. > > Interesting. Care to take a stab at implementing it? I actually have worked through the last night on the day job, have urgent stuff piling up in my freelance work queue, and the next thing I need to finish for git is some smart stuff for delta packing. So it's unlikely I'll get to _that_ anytime soon. However, I had a hilarious idea on the way home that kept me rather amused (perhaps my programmer's humour is affected by sleep deprivation). I was annoyed at needing double the space because of having to keep score of both hash and file number. So I came up with a rather cute manner to avoid this: first do all files in isolation with full precision, but store the resulting list of hash as difference to the last value. When merging the data of 2^k and 2^k (or somewhat less) files, we multiply the values by two (this will not carry except for utterly improbable cases or very small data sets which we can do differently) and add one bit of identification. When we have just a single sequence remaining, undeltafying will tell us about collisions in the high bits, and the affected files in the low bits. Of course, using a merge-like algorithm means that we temporarily need double space anyway. Which takes some of the fun. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html