Ingo Molnar <mingo@xxxxxxx> writes: > diff --git a/object.c b/object.c > index 7e1f2bb..b3fe485 100644 > --- a/object.c > +++ b/object.c > @@ -91,7 +91,7 @@ struct object *lookup_object(const unsigned char *sha1) > static void grow_object_hash(void) > { > int i; > - int new_hash_size = obj_hash_size < 32 ? 32 : 2 * obj_hash_size; > + int new_hash_size = obj_hash_size < 32 ? 32 : 16 * obj_hash_size; > struct object **new_hash; > > new_hash = xcalloc(new_hash_size, sizeof(struct object *)); > @@ -116,7 +116,7 @@ void *create_object(const unsigned char *sha1, int type, void *o) > obj->flags = 0; > hashcpy(obj->sha1, sha1); > > - if (obj_hash_size - 1 <= nr_objs * 2) > + if (obj_hash_size - 1 <= nr_objs * 16) > grow_object_hash(); > > insert_obj_hash(obj, obj_hash, obj_hash_size); Shawn was telling me about this exact topic a few months ago, and I do agree that object hash grows too slowly when you need to slurp in many objects. A few random thoughts, some are related and others are unrelated to what your patch does: - We start out from a tiny table of 32 entries. Would it make things noticeably worse if we start with a larger table for a workload that touch only a few dozen objects? How about starting from a table with say 4 pages worth of pointers, or something? Note that this is not about helping the case with near full-walk. - I agree x2 is growing the table too slowly, but at the same time, I do not think x16 is growing fast enough, if you will end up slurping millions of objects. You would still need to rehash 5 times (maybe 4, I cannot count). Worse yet, the later rehash costs proportionally more (IOW having to rehash 5 times is not just 25% more expensive than having to rehash 4 times). - If we grow the table too fast, wouldn't it make the largest table we could use smaller? When we try to grow a large but crowded table by x2, we may be able to get enough core to rehash, but we may not be able to allocate x16 such an already large table. I have this hunch that the workloads that truly require to hold huge number of objects are limited, and we can enumerate them relatively easily. The callchain from gc to repack to pack-objects is one. The codepath in pack-objects that is called from "repack -a -d" should be able to guess that it needs to slurp everything from the fact that there is no negative ref given in the revision walking machinery. It also should be able to guess if the repository has large number of objects by looking at the existing pack .idx files (they record the number of objects the corresponding .pack files contain in a way that is cheap to read). It might make sense to give an explicit hit to grow_object_hash() in such a case (i.e. the caller e.g. pack-objects sets a flag for it to notice), and have grow_object_hash() immediately jump to a huge hash size. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html