Re: [PATCH] lookup_object(): Speed up 'git gc' by 12%, by reducing hash chain length

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 28 Apr 2011 23:58:17 -0700

Ingo Molnar <mingo@xxxxxxx> writes:

> diff --git a/object.c b/object.c
> index 7e1f2bb..b3fe485 100644
> --- a/object.c
> +++ b/object.c
> @@ -91,7 +91,7 @@ struct object *lookup_object(const unsigned char *sha1)
>  static void grow_object_hash(void)
>  {
>  	int i;
> -	int new_hash_size = obj_hash_size < 32 ? 32 : 2 * obj_hash_size;
> +	int new_hash_size = obj_hash_size < 32 ? 32 : 16 * obj_hash_size;
>  	struct object **new_hash;
>  
>  	new_hash = xcalloc(new_hash_size, sizeof(struct object *));
> @@ -116,7 +116,7 @@ void *create_object(const unsigned char *sha1, int type, void *o)
>  	obj->flags = 0;
>  	hashcpy(obj->sha1, sha1);
>  
> -	if (obj_hash_size - 1 <= nr_objs * 2)
> +	if (obj_hash_size - 1 <= nr_objs * 16)
>  		grow_object_hash();
>  
>  	insert_obj_hash(obj, obj_hash, obj_hash_size);

Shawn was telling me about this exact topic a few months ago, and I do
agree that object hash grows too slowly when you need to slurp in many
objects.

A few random thoughts, some are related and others are unrelated to what
your patch does:

 - We start out from a tiny table of 32 entries.  Would it make things
   noticeably worse if we start with a larger table for a workload that
   touch only a few dozen objects?  How about starting from a table with
   say 4 pages worth of pointers, or something?  Note that this is not
   about helping the case with near full-walk.

 - I agree x2 is growing the table too slowly, but at the same time, I do
   not think x16 is growing fast enough, if you will end up slurping
   millions of objects.  You would still need to rehash 5 times (maybe 4,
   I cannot count).

   Worse yet, the later rehash costs proportionally more (IOW having to
   rehash 5 times is not just 25% more expensive than having to rehash 4
   times).

 - If we grow the table too fast, wouldn't it make the largest table we
   could use smaller?  When we try to grow a large but crowded table by
   x2, we may be able to get enough core to rehash, but we may not be able
   to allocate x16 such an already large table.

I have this hunch that the workloads that truly require to hold huge
number of objects are limited, and we can enumerate them relatively
easily.  The callchain from gc to repack to pack-objects is one.  The
codepath in pack-objects that is called from "repack -a -d" should be able
to guess that it needs to slurp everything from the fact that there is no
negative ref given in the revision walking machinery.  It also should be
able to guess if the repository has large number of objects by looking at
the existing pack .idx files (they record the number of objects the
corresponding .pack files contain in a way that is cheap to read).

It might make sense to give an explicit hit to grow_object_hash() in such
a case (i.e. the caller e.g. pack-objects sets a flag for it to notice),
and have grow_object_hash() immediately jump to a huge hash size.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html