Re: Hash Tables

Thomas Rast <trast@xxxxxxxxxxxxxxx> · Thu, 6 Aug 2009 10:53:37 +0200

Philip Herron wrote:
> 
> Question still stands is the hashing function [in hash.c], which one and why?

In the spirit of teaching you to fish...

First you'll want to find out where the original users of this code
were.  So you run

  git blame -- hash.c

and see that most of the lines come from 9027f53 (Do linear-time/space
rename logic for exact renames, 2007-10-25).  So you can then look at
this commit:

  git show 9027f53c

Aha, it says

    In the expectation that we will indeed do the same hashing trick for the
    general rename case, this code uses a generic hash-table implementation
    that can be used for other things too.  In fact, we might be able to
    consolidate some of our existing hash tables with the new generic code
    in hash.[ch]

and further down in the patch

+       hash = hash_filespec(filespec);
+       pos = insert_hash(hash, entry, table);

and right above that

+static unsigned int hash_filespec(struct diff_filespec *filespec)
+{
+       unsigned int hash;
+       if (!filespec->sha1_valid) {
+               if (diff_populate_filespec(filespec, 0))
+                       return 0;
+               hash_sha1_file(filespec->data, filespec->size, "blob", filespec-
+       }
+       memcpy(&hash, filespec->sha1, sizeof(hash));
+       return hash;
+}

See?

As for the *why*, presumably because all of git assumes two objects
with the same SHA1 are indeed the same file; so we can later make the
same optimisation again:

+                       if (hashcmp(one->sha1, two->sha1))
+                               continue;

And then, as we've already computed the SHA1, any subset of it is as
good a hash as anything else; it'll be uniformly distributed.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html