Re: 1.3.0 creating bigger packs than 1.2.3

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Shawn Pearce <spearce@xxxxxxxxxxx> writes:

> I just spent some time bisecting this issue and it looks like the
> following change by Junio may be the culprit:
>
>   commit 1d6b38cc76c348e2477506ca9759fc241e3d0d46
>   Author: Junio C Hamano <junkio@xxxxxxx>
>   Date:   Wed Feb 22 22:10:24 2006 -0800
>   
>       pack-objects: use full pathname to help hashing with "thin" pack.
>       
>       This uses the same hashing algorithm to the "preferred base
>       tree" objects and the incoming pathnames, to group the same
>       files from different revs together, while spreading files with
>       the same basename in different directories.
>       
>       Signed-off-by: Junio C Hamano <junkio@xxxxxxx>
>   

Unfortunately, that is not the same hash we use in v1.3.0, so we
need to look elsewhere for interactions.

v1.2.3 hash was base-name only.  doc/Makefile and t/Makefile
were thrown in the same bin and sorted by size.  When the
history you are packing is deep, and doc/Makefile and t/Makefile
are not related at all, this made effective size of delta window
1/N where N is the number of such duplicates.

The one you found above uses a hash that is fully full-path.
The two are in completely different bins, and bins are totally
random.  This was not a good strategy.

v1.3.0 hash is base-name hash concatenated with leading-path
has.  t/Makefile and doc/Makefile go in separate bins, but the
bins are close to each other; this avoids the problem in v1.2.3
when you have deep history, but at the same time if you do not
have many many versions of t/Makefile to overflow the delta
window, it gives t/Makefile a chance to delta with doc/Makefile.

The earlier observation by Linus on reverting eeef7135 is
consistent with it; that commit was the one that introduced
v1.3.0 hash.

You could try this patch to resurrect the hash used in v1.2.3,
and you may get better packing for your particular repository;
but I am not sure if it gives better results in the general
case.  I am running the test myself now while waiting for my
day-job database to load X-<.

NOTE NOTE NOTE.  The hash in v1.2.3 was done with the basename
(relying on rev-list --objects to only show the basename) and
hashed from front to back.  The current one uses the same hash
scrambling function, but it hashes from back to front, and it
knows rev-list --objects gives it a full path.

What this patch does is to stop the hashing after we are done
with the basename part.  So it still gives different hash value
to the same path from v1.2.3 version, but the distribution
should be equivalent.

NOTE 2.  Feeding output from the current "rev-list --objects" to
v1.2.3 pack-object is the same as "hash full path and spread
things out" intermediate version, which is the worst performer.

-- >8 --
git diff
diff --git a/pack-objects.c b/pack-objects.c
index 09f4f2c..e58e169 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -492,6 +492,8 @@ static unsigned name_hash(struct name_pa
 		name_hash = hash;
 		hash = 0;
 	}
+	return name_hash;
+
 	for (p = path; p; p = p->up) {
 		hash = hash * 11 + '/';
 		n = p->elem + p->len;



-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]