Re: [PATCH 03/16] pack-objects: use a faster hash table

Ramkumar Ramachandra <artagnon@xxxxxxxxx> · Tue, 25 Jun 2013 23:28:12 +0530

Vicent Marti wrote:
> When replacing the old hash table implementation in `pack-objects` with
> the khash_sha1 table, the insertion time is greatly reduced:

Why?  What is the exact change?

> The big win here, however, is in the massively reduced amount of hash
> collisions

Okay, so there seems to be some problem with how collisions are
handled in the hashtable.

> -static int locate_object_entry_hash(const unsigned char *sha1)
> -{
> -       int i;
> -       unsigned int ui;
> -       memcpy(&ui, sha1, sizeof(unsigned int));
> -       i = ui % object_ix_hashsz;
> -       while (0 < object_ix[i]) {
> -               if (!hashcmp(sha1, objects[object_ix[i] - 1].idx.sha1))
> -                       return i;
> -               if (++i == object_ix_hashsz)
> -                       i = 0;
> -       }
> -       return -1 - i;
> -}

Classical chaining to handle collisions: very naive.  Deserves to be thrown out.

> -static void rehash_objects(void)
> -{
> -       uint32_t i;
> -       struct object_entry *oe;
> -
> -       object_ix_hashsz = nr_objects * 3;
> -       if (object_ix_hashsz < 1024)
> -               object_ix_hashsz = 1024;
> -       object_ix = xrealloc(object_ix, sizeof(int) * object_ix_hashsz);
> -       memset(object_ix, 0, sizeof(int) * object_ix_hashsz);
> -       for (i = 0, oe = objects; i < nr_objects; i++, oe++) {
> -               int ix = locate_object_entry_hash(oe->idx.sha1);
> -               if (0 <= ix)
> -                       continue;
> -               ix = -1 - ix;
> -               object_ix[ix] = i + 1;
> -       }
> -}

This is called when the hashtable runs out of space.  It didn't appear
in your profiler because it doesn't appear to be a bottleneck, right?
Growing aggressively to 3x times the number of objects probably
explains it.  Just for comparison, how does khash grow?

>  static struct object_entry *locate_object_entry(const unsigned char *sha1)
>  {
> -       int i;
> +       khiter_t pos = kh_get_sha1(packed_objects, sha1);
>
> -       if (!object_ix_hashsz)
> -               return NULL;
> +       if (pos < kh_end(packed_objects)) {

Wait, why is this required?  When will kh_get_sha1() return a position
beyond kh_end()?  What does that mean?

> +               return kh_value(packed_objects, pos);
> +       }
>
> -       i = locate_object_entry_hash(sha1);
> -       if (0 <= i)
> -               return &objects[object_ix[i]-1];
>         return NULL;
>  }

Overall, replaced call to locate_object_entry_hash() with a call to
kh_get_sha1().  Okay.

> -static int add_object_entry(const unsigned char *sha1, enum object_type type,
> -                           const char *name, int exclude)
> +static int add_object_entry_1(const unsigned char *sha1, enum object_type type,
> +                           uint32_t hash, int exclude, struct packed_git *found_pack,
> +                               off_t found_offset)
>  {
>         struct object_entry *entry;
> -       struct packed_git *p, *found_pack = NULL;
> -       off_t found_offset = 0;
> -       int ix;
> -       unsigned hash = name_hash(name);
> +       struct packed_git *p;
> +       khiter_t ix;
> +       int hash_ret;
>
> -       ix = nr_objects ? locate_object_entry_hash(sha1) : -1;
> -       if (ix >= 0) {
> +       ix = kh_put_sha1(packed_objects, sha1, &hash_ret);

You don't need to call locate_object_entry() to check for collisions
because kh_put_sha1() takes care of that?

> +       if (hash_ret == 0) {
>                 if (exclude) {
> -                       entry = objects + object_ix[ix] - 1;
> +                       entry = kh_value(packed_objects, ix);

Superficial change: using kh_value(), because we stripped out the
chaining logic.

> @@ -966,19 +965,30 @@ static int add_object_entry(const unsigned char *sha1, enum object_type type,
>                 entry->in_pack_offset = found_offset;
>         }
>
> -       if (object_ix_hashsz * 3 <= nr_objects * 4)
> -               rehash_objects();
> -       else
> -               object_ix[-1 - ix] = nr_objects;
> +       kh_value(packed_objects, ix) = entry;
> +       kh_key(packed_objects, ix) = entry->idx.sha1;
> +       objects[nr_objects++] = entry;

Wait, what?  Why didn't you use kh_put_sha1()?

I didn't look very carefully, but the patch seems to be okay overall.
On the issue of which hashtable replacement to use (why khash, and not
something else?), I briefly looked at linux.git's linux/hashtable.h
and git.git's hash.h; both of them are chaining hashes.  From a brief
look at khash.h, it seems to be somewhat less naive and sane: my only
concern is that it is written entirely in using CPP macros which is a
great for syntax/performance, but not-so-great for debugging.  I don't
know if there's a better off-the-shelf implementation out there, but I
haven't been looking for one either.  By the way, it's MIT license
authored by an anonymous person (sources at:
https://github.com/attractivechaos/klib/blob/master/khash.h), but I
don't know if that's a problem.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html