On Tue, Oct 02, 2007 at 06:31:18PM +0200, David Kastrup wrote: > This does not actually require an actual merge _sort_ AFAICS: do the > "sort file.hashed" step using qsort. The comparison step does not > actually need to produce merged output, but merely advances through > two hash arrays and generates statistics. Right, that's why I used "merge" in quotes. The sort used in the O(n) step is irrelevant, but we are doing a merge-sort-like behavior in the second step (except instead of actually merging into a new list, we are summarizing the comparisons in a numeric "difference" variable). But I think we are on the same page. > This should already beat the pants off the current implementation, > even when the hash array is sparse, simply because our inner loop then > has perfect hash coherence. Yes, I hope so. We'll see. :) > Getting rid of this outer O(n^2) remains an interesting challenge, > though. One way would be the following: fill a _single_ array with > entries containing _both_ hash and file number. Sort this, and then > gather the statistics of hash runs by making a single pass through. > That reduces the O(n^2) behavior to only those parts with actual hash > collisions. Interesting. Care to take a stab at implementing it? -Peff - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html