Re: What's cooking in git.git (topics)

David Kastrup <dak@xxxxxxx> · Thu, 27 Sep 2007 08:43:14 +0200

David Kastrup <dak@xxxxxxx> writes:

> Jeff King <peff@xxxxxxxx> writes:
>
>> On Wed, Sep 26, 2007 at 01:05:59PM -0700, Junio C Hamano wrote:
>>
>>> * jk/diff-rename (Tue Sep 25 15:29:42 2007 -0400) 1 commit
>>>  + diffcore-rename: cache file deltas
>>> 
>>> Parked in 'next' for now but is 'master' material.
>>
>> My tests after this patch show that spanhash_find is responsible for
>> a large portion of the processing time in large renames, so I am going
>> to look into speeding that up.
>
> In itself, it does not look like there is all too much room for
> optimization.  One can remove the temporary pointer "optimization" and
> see whether this makes strength reduction possible for the compiler.
> Making this an endless loop wrapped around a loop on bucket might also
> help the compiler in that effect.
>
> But there is really not all too much leeway, and it might be better
> spent in the caller.  For example, the search will take something like
> r/(1-r) iterations on average where r is the fill ratio of the hash
> array.  So one would not want to, say, let r grow above 0.75 or
> something like that.

Ok, here is some suggestion:

Here is the inner loop for this stuff:

	for (i = 0; i < ssz; i++) {
		struct spanhash *s = &(src_count->data[i]);
		struct spanhash *d;
		unsigned dst_cnt, src_cnt;
		if (!s->cnt)
			continue;
		src_cnt = s->cnt;
		d = spanhash_find(dst_count, s->hashval);
		dst_cnt = d ? d->cnt : 0;
		if (src_cnt < dst_cnt) {
			la += dst_cnt - src_cnt;
			sc += src_cnt;
		}
		else
			sc += dst_cnt;
	}

Now here is how one could optimize the data structures: The hash
structures are with linear probing, and we try to find any hash
matches from source to destination.  If we sort all hashes indexed to
a given first hash bucket by their full hash value, then one could
basically use passes similar to list merges for figuring the 1:1
relations.  That cuts down the O(l n) cost (where n is the number of
elements and l their average run length) to O(n).

Of course, making l close to 1 by keeping the hash utilization
reasonably low is much simpler.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html