Shawn Pearce <spearce@xxxxxxxxxxx> writes: > I just spent some time bisecting this issue and it looks like the > following change by Junio may be the culprit: > > commit 1d6b38cc76c348e2477506ca9759fc241e3d0d46 > Author: Junio C Hamano <junkio@xxxxxxx> > Date: Wed Feb 22 22:10:24 2006 -0800 > > pack-objects: use full pathname to help hashing with "thin" pack. > > This uses the same hashing algorithm to the "preferred base > tree" objects and the incoming pathnames, to group the same > files from different revs together, while spreading files with > the same basename in different directories. > > Signed-off-by: Junio C Hamano <junkio@xxxxxxx> > Unfortunately, that is not the same hash we use in v1.3.0, so we need to look elsewhere for interactions. v1.2.3 hash was base-name only. doc/Makefile and t/Makefile were thrown in the same bin and sorted by size. When the history you are packing is deep, and doc/Makefile and t/Makefile are not related at all, this made effective size of delta window 1/N where N is the number of such duplicates. The one you found above uses a hash that is fully full-path. The two are in completely different bins, and bins are totally random. This was not a good strategy. v1.3.0 hash is base-name hash concatenated with leading-path has. t/Makefile and doc/Makefile go in separate bins, but the bins are close to each other; this avoids the problem in v1.2.3 when you have deep history, but at the same time if you do not have many many versions of t/Makefile to overflow the delta window, it gives t/Makefile a chance to delta with doc/Makefile. The earlier observation by Linus on reverting eeef7135 is consistent with it; that commit was the one that introduced v1.3.0 hash. You could try this patch to resurrect the hash used in v1.2.3, and you may get better packing for your particular repository; but I am not sure if it gives better results in the general case. I am running the test myself now while waiting for my day-job database to load X-<. NOTE NOTE NOTE. The hash in v1.2.3 was done with the basename (relying on rev-list --objects to only show the basename) and hashed from front to back. The current one uses the same hash scrambling function, but it hashes from back to front, and it knows rev-list --objects gives it a full path. What this patch does is to stop the hashing after we are done with the basename part. So it still gives different hash value to the same path from v1.2.3 version, but the distribution should be equivalent. NOTE 2. Feeding output from the current "rev-list --objects" to v1.2.3 pack-object is the same as "hash full path and spread things out" intermediate version, which is the worst performer. -- >8 -- git diff diff --git a/pack-objects.c b/pack-objects.c index 09f4f2c..e58e169 100644 --- a/pack-objects.c +++ b/pack-objects.c @@ -492,6 +492,8 @@ static unsigned name_hash(struct name_pa name_hash = hash; hash = 0; } + return name_hash; + for (p = path; p; p = p->up) { hash = hash * 11 + '/'; n = p->elem + p->len; - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html