Re: [PATCH, take 1] Linear-time/space rename logic (exact renames only)

Jeff King <peff@xxxxxxxx> · Mon, 22 Oct 2007 02:47:23 -0400

On Sun, Oct 21, 2007 at 04:59:03PM -0700, Linus Torvalds wrote:

> What it does is to rather than iterate over all sources and destinations 
> and checking if they are identical (which is O(src*dst)), it hashes each 
> of the sources and destinations into a hash table, using the SHA1 hash of 
> the contents as the hash. That's O(n+m). It then walks the hash table 
> (which is also O(m+n) in size), and only pairs up files for comparison 
> that hashed to the same spot.
> 
> Doing this for more than just the exact same contents would be basically 
> the same thing, except it starts hashing up fingerprints of the contents 
> and linking up file pairs that get linked up by those fingerprints. More 
> involved, but not impossible.

Hrm. For the inexact case, it seems like you should be able to do
something like this:

 1. put the content fingerprint hashes in a hash table
 2. create a single src * dst table of scores
 3. walk the hash table; for each bucket in which there are collisions,
    find every pair in that bucket, and add to the similarity score in
    the big score table
 4. for each src in the similarity table, find the maximum dest

Step 1 is clearly O(n+m). Step 2 allocates O(n*m) memory, but we only
need a single integer in each slot. The outer walk in step 3 is
O(fingerprints), which is, O(content_size * (n+m)). The inner loop can
actually be worst case O(n*m), but is more likely to be a handful of
pairs (assuming that for any fingerprint chunk, it is only going to
exist in a few files; you would get worst case if you were comparing a
bunch of files with identical contents). And step 4 is actually going to
end up being O(n*m), since you have to find the best match for each.

So there is actually still O(n*m) behavior, but I wonder if we are
tightening the O(n*m) loop enough that we will see improvement.

Hrm. I wonder if we could keep a "best" pointer for each src file, and
when we update the score for a given src/dest pair, check for a new
best. That would add more to step 3, but make step 4 O(n).

And of course the n*m memory is going to bottleneck. I think for 1000 *
1000, it should be fine, but if you want to try 100,000 * 100,000, you
are going to need tens of gigabytes. And it's going to be very sparse.
So perhaps a hash table with keys of src+dst mapped to scores. And then
we could sort the whole table afterwards, and just run through it
linearly, which helps step 4.

>  - it looks ok, and I've tested it some, but this needs more people 
>    looking at it.

Overall it looks like a sane approach. My comments are below.

>  - in fact, the big optimization isn't the actual hash table, but the 
>    independent and much simpler "diff_filespec->used" optimization for a 
>    deleted filename that was used for a rename/copy.

Do you have separate timings? The "diff_filespec->used" optimization
appears to be cutting out an O(n*m) strcmp. If it is most of the
optimization, I wonder if the complexity of the hash change is worth it
(although I find the new code easier to read, so maybe it is worth it on
those grounds alone).

> +struct file_similarity {
> +	int src_dst, index;

It took me a while to figure out all of the meanings of src_dst; maybe a
comment is in order?

> +static int find_identical_files(struct file_similarity *src,
> +				struct file_similarity *dst)

Your function naming is a bit confusing. You have find_identical_files,
find_same_files, and find_exact_renames, all of which do the same thing,
but for different levels of input. Perhaps the names should reflect how
they are different?

> +static int find_identical_files(struct file_similarity *src,
> +				struct file_similarity *dst)
> +{
> +	int renames = 0;
> +	do {
> +		struct diff_filespec *one = src->filespec;
> +		struct file_similarity *p, *best;
> +		int i = 100;
> +
> +		best = NULL;
> +		for (p = dst; p; p = p->next) {
> +			struct diff_filespec *two = p->filespec;
> +
> +			/* Already picked as a destination? */
> +			if (!p->src_dst)
> +				continue;
> +			/* False hash collission? */
> +			if (hashcmp(one->sha1, two->sha1))
> +				continue;
> +			best = p;
> +			if (basename_same(one, two))
> +				break;
> +
> +			/* Too many identical alternatives? Pick one */
> +			if (!--i)
> +				break;
> +		}
> +		if (best) {
> +			best->src_dst = 0;
> +			record_rename_pair(best->index, src->index, MAX_SCORE);
> +			renames++;
> +		}
> +	} while ((src = src->next) != NULL);
> +	return renames;
> +}

The "too many identical alternatives" sanity check is interesting. It
can produce suboptimal results if, e.g., the 101st entry has the same
basename. I suppose it is necessary to prevent a worst case "oops, all
of these files are identical" O(n*m) behavior. But generally I would
favor "always correct, pathological cases slow" to "always fast,
pathological cases incorrect". But maybe the basename heuristic isn't
worth considering "correct" in this case (and honestly, I doubt anyone
will hit the 100 limit anyway).

> +/*
> + * Note: the rest of the rename logic depends on this
> + * phase also populating all the filespecs for any
> + * entry that isn't matched up with an exact rename.
> + */
> +static int free_file_table(void *ptr)
> +{
> +	struct file_similarity *p = ptr;
> +	do {
> +		struct file_similarity *entry = p;
> +		p = p->next;
> +
> +		/* Stupid special case, see note above! */
> +		diff_populate_filespec(entry->filespec, 0);
> +		free(entry);
> +	} while (p);
> +	return 0;
> +}

diff_populate_filespec knows whether or not it needs to do any work. I
wonder if those parts of the rename process that need it should just
call it unconditionally. It seems like a fragile dependency. Besides
which...

> +static unsigned int hash_filespec(struct diff_filespec *filespec)
> +{
> +	unsigned int hash;
> +	if (!filespec->sha1_valid) {
> +		if (diff_populate_filespec(filespec, 0))
> +			return 0;
> +		hash_sha1_file(filespec->data, filespec->size, "blob", filespec->sha1);
> +	}
> +	memcpy(&hash, filespec->sha1, sizeof(hash));
> +	return hash;
> +}

...if everything has been through this hash function, why do we need to
populate again upon freeing?

> +static struct hash_table_entry *lookup_hash_entry(unsigned int hash, struct hash_table *table)
> +{
> +	unsigned int size = table->size, nr = hash % size;
> +	struct hash_table_entry *array = table->array;
> +
> +	while (array[nr].ptr) {
> +		if (array[nr].hash == hash)
> +			break;
> +		nr++;
> +		if (nr >= size)
> +			nr = 0;
> +	}
> +	return array + nr;
> +}

Rather than having buckets where collisions extend in a list within the
bucket, it looks like you are just overflowing into the next bucket.
It seems like you could end up with pretty bad "runs" of full buckets.
E.g., bucket 1 has a collision, so it bleeds into bucket 2.  Now when we
place something in bucket 2, it has to skip past bucket 1's overflow.
And a collision in bucket 2 means we have to skip the overflow for
buckets 1 _and_ 2. And so on, until we can resync by finding enough free
buckets to accept all of our collisions. So it depends on keeping a lot
of holes in the hash structure, which I see that you do, but I wonder
what the optimal value is.

I assume you chose this method to reduce memory fragmentation, which
makes sense. And maybe there is a simple answer that "50-75% free gives
good results over evenly distributed hashes." Though perhaps we should
also consider clustered hashes (especially if we want to put the
fingerprint hashes directly in here).

> +/*
> + * Insert a new hash entry pointer into the table.
> + *
> + * If that hash entry already existed, return the pointer to
> + * the existing entry (and the caller can create a list of the
> + * pointers or do anything else). If it didn't exist, return
> + * NULL (and the caller knows the pointer has been inserted).
> + */
> +static void **insert_hash_entry(unsigned int hash, void *ptr, struct hash_table *table)

This calling convention seems a bit clunky. Since all callers have to
check for hash collision anyway, why not just unconditionally return the
pointer to the data ptr (which will itself be NULL if this is the first
entry)? Then you can drop the conditional, and:

  pos = insert_hash(hash, entry, table);
  if (pos) {
    entry->next = *pos;
    *pos = entry;
  }

can simply become:

  pos = insert_hash(hash, table);
  entry->next = *pos;
  *pos = entry;

> +void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table)
> +{
> +	unsigned int nr = table->nr;
> +	if (nr >= table->size/2)
> +		grow_hash_table(table);
> +	return insert_hash_entry(hash, ptr, table);
> +}

Style nitpick: the local "nr" is rather pointless.

-Peff
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html