Re: [PATCH/RFC] index-pack: produce pack index version 3

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Shawn Pearce <spearce@xxxxxxxxxxx> writes:

> ... But I think its worth giving
> him a few weeks to finish getting the code ready, vs. rushing
> something in that someone else thinks might help. We have waited more
> than 6 years or whatever to improve packing. Colby's experiments are
> showing massive improvements (40s enumeration cut to 100ms) with low
> disk usage increase (<10%) and no pack file format changes.

No matter what you do, you would be saving the bitmap somewhere
outside the *.pack file, yes?  Will it be in some extended section
of the new *.idx file?

With the bitmap, your object enumeration phase may go very fast, and
you would be able to figure out, in response to a fetch request "I
have these tips of refs; please update me with these refs of yours",
the set of objects you would need to pack (i.e. the ones that are
reachable from your refs that were asked for, but that are not
reachable from the refs the requestor has).  

Among these objects, there will be ones that are expressed as a
delta against what you are going to send, or as a delta against what
you know the recipient must already have (if you are using thin pack
transfer) in the packfiles you have, and you can send these deltas
as-is without recomputation.

But there will be ones that are either expressed as a base in your
packfile, or as a delta against something you are not going to send
and you know that the recipient does not have.  In order to turn
these objects into deltas, it may be necessary to have a way to
record which "delta chain" each object belongs to, and if you are
introducing the mechanism to have extended sections in the new *.idx
file, that may be a good place to do so.  When you need to express
an object that your bitmap told you to send (as opposed to rev-list
walking told you with the paths to the objects), you can find other
objects that belong to the same "delta chain" and that you know are
available to the recipient when it starts to fixing the thin pack
using that extra piece of information, in order to find the optimal
delta base to encode such an object against.

Just for fun, I applied the attached patch and repacked the history
leading to v1.7.12-rc3 with the default depth/window:

  git rev-list --objects --all |
  git pack-objects \
      --no-reuse-delta --no-reuse-object --delta-base-offset \
      [--no-namehash] pack

with and without the experimental --no-namehash option.  The result
is staggering.  With name-hash to group objects from the same path
close together in the delta window, the resulting pack is 33M.
Without the name-hash hint, the same pack is 163M!  Needless to say,
keeping the objects that should be hashed together inside a delta
window is really important.

An obvious way to record the "delta chain" is to simply keep the
name_hash of each object in the pack, which would need 2 bytes per
object in the pack, that would bloat pack_idx_entry size from 32
bytes to 34 bytes per entry.  That way, after your bitmap discovers
an object that cannot reuse existing deltas, you can throw it, other
such objects with the same name-hash, and then objects that you know
will be available to the recipient (you mark the last category of
objects as "preferred base"), into the delta_list so that they are
close together in the delta window.

And here is a patch to experiment the name-hash based grouping
heuristics.

 builtin/pack-objects.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git c/builtin/pack-objects.c w/builtin/pack-objects.c
index 782e7d0..a2dd67b 100644
--- c/builtin/pack-objects.c
+++ w/builtin/pack-objects.c
@@ -67,6 +67,7 @@ static int keep_unreachable, unpack_unreachable, include_tag;
 static unsigned long unpack_unreachable_expiration;
 static int local;
 static int incremental;
+static int use_name_hash = 1;
 static int ignore_packed_keep;
 static int allow_ofs_delta;
 static struct pack_idx_option pack_idx_opts;
@@ -863,7 +864,7 @@ static unsigned name_hash(const char *name)
 {
 	unsigned c, hash = 0;
 
-	if (!name)
+	if (!name || !use_name_hash)
 		return 0;
 
 	/*
@@ -2468,6 +2469,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			  "limit pack window by memory in addition to object limit"),
 		OPT_INTEGER(0, "depth", &depth,
 			    "maximum length of delta chain allowed in the resulting pack"),
+		OPT_BOOL(0, "namehash", &use_name_hash,
+			 "assign name hint to each path"),
 		OPT_BOOL(0, "reuse-delta", &reuse_delta,
 			 "reuse existing deltas"),
 		OPT_BOOL(0, "reuse-object", &reuse_object,

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]