Re: What's cooking in git.git (topics)

David Kastrup <dak@xxxxxxx> · Tue, 02 Oct 2007 20:44:30 +0200

Jeff King <peff@xxxxxxxx> writes:

> On Tue, Oct 02, 2007 at 06:31:18PM +0200, David Kastrup wrote:
>
>> This does not actually require an actual merge _sort_ AFAICS: do the
>> "sort file.hashed" step using qsort.  The comparison step does not
>> actually need to produce merged output, but merely advances through
>> two hash arrays and generates statistics.
>
> Right, that's why I used "merge" in quotes.
>
>> This should already beat the pants off the current implementation,
>> even when the hash array is sparse, simply because our inner loop
>> then has perfect hash coherence.
>
> Yes, I hope so. We'll see. :)
>
>> Getting rid of this outer O(n^2) remains an interesting challenge,
>> though.  One way would be the following: fill a _single_ array with
>> entries containing _both_ hash and file number.  Sort this, and
>> then gather the statistics of hash runs by making a single pass
>> through.  That reduces the O(n^2) behavior to only those parts with
>> actual hash collisions.
>
> Interesting. Care to take a stab at implementing it?

I actually have worked through the last night on the day job, have
urgent stuff piling up in my freelance work queue, and the next thing
I need to finish for git is some smart stuff for delta packing.

So it's unlikely I'll get to _that_ anytime soon.  However, I had a
hilarious idea on the way home that kept me rather amused (perhaps my
programmer's humour is affected by sleep deprivation).

I was annoyed at needing double the space because of having to keep
score of both hash and file number.  So I came up with a rather cute
manner to avoid this: first do all files in isolation with full
precision, but store the resulting list of hash as difference to the
last value.  When merging the data of 2^k and 2^k (or somewhat less)
files, we multiply the values by two (this will not carry except for
utterly improbable cases or very small data sets which we can do
differently) and add one bit of identification.  When we have just a
single sequence remaining, undeltafying will tell us about collisions
in the high bits, and the affected files in the low bits.

Of course, using a merge-like algorithm means that we temporarily need
double space anyway.  Which takes some of the fun.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html