Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jul 08, 2016 at 01:38:55PM +0300, Kirill Smelkov wrote:
> Peff first of all thanks for feedback,
> 
> On Thu, Jul 07, 2016 at 04:52:23PM -0400, Jeff King wrote:
> > On Thu, Jul 07, 2016 at 10:09:17PM +0300, Kirill Smelkov wrote:
> > 
> > > Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
> > > if a repository has bitmap index, pack-objects can nicely speedup
> > > "Counting objects" graph traversal phase. That however was done only for
> > > case when resultant pack is sent to stdout, not written into a file.
> > > 
> > > We can teach pack-objects to use bitmap index for initial object
> > > counting phase when generating resultant pack file too:
> > 
> > I'm not sure this is a good idea in general. When bitmaps are in use, we
> > cannot fill out the details in the object-packing list as thoroughly. In
> > particular:
> > 
> >   - we will not compute the same write order (which is based on
> >     traversal order), leading to packs that have less efficient cache
> >     characteristics
> 
> I agree the order can be not exactly the same. Still if original pack is
> packed well (with good recency order), while using bitmap we will tend
> to traverse it in close to original order.
> 
> Maybe I'm not completely right on this, but to me it looks to be the
> case because if objects in original pack are put there linearly sorted
> by recency order, and we use bitmap index to set of all reachable
> objects from a root, and then just _linearly_ gather all those objects
> from original pack by 1s in bitmap and put them in the same order into
> destination pack, the recency order won't be broken.
> 
> Or am I maybe misunderstanding something?
> 
> Please also see below:
> 
> >   - we don't learn about the filename of trees and blobs, which is going
> >     to make the delta step much less efficient. This might be mitigated
> >     by turning on the bitmap name-hash cache; I don't recall how much
> >     detail pack-objects needs on the name (i.e., the full name versus
> >     just the hash).
> 
> If I understand it right, it uses only uint32_t name hash while searching. From
> pack-objects.{h,c} :
> 
> ---- 8< ----
> struct object_entry {
> 	...
> 	uint32_t hash;                  /* name hint hash */
> 
> 
> /*
>  * We search for deltas in a list sorted by type, by filename hash, and then
>  * by size, so that we see progressively smaller and smaller files.
>  * That's because we prefer deltas to be from the bigger file
>  * to the smaller -- deletes are potentially cheaper, but perhaps
>  * more importantly, the bigger file is likely the more recent
>  * one.  The deepest deltas are therefore the oldest objects which are
>  * less susceptible to be accessed often.
>  */
> static int type_size_sort(const void *_a, const void *_b)
> {
>         const struct object_entry *a = *(struct object_entry **)_a;
>         const struct object_entry *b = *(struct object_entry **)_b;
> 
>         if (a->type > b->type)
>                 return -1;
>         if (a->type < b->type) 
>                 return 1;
>         if (a->hash > b->hash)
>                 return -1;
>         if (a->hash < b->hash)
>                 return 1;
> 	...
> ---- 8< ----
> 
> Documentation/technical/pack-heuristics.txt also confirms this:
> 
> ---- 8< ----
>     ...
>     <gitster> The quote from the above linus should be rewritten a
>         bit (wait for it):
>         - first sort by type.  Different objects never delta with
>           each other.
>         - then sort by filename/dirname.  hash of the basename
>           occupies the top BITS_PER_INT-DIR_BITS bits, and bottom
>           DIR_BITS are for the hash of leading path elements.
> 
>     ...
> 
>     If I might add, the trick is to make files that _might_ be similar be
>     located close to each other in the hash buckets based on their file
>     names.  It used to be that "foo/Makefile", "bar/baz/quux/Makefile" and
>     "Makefile" all landed in the same bucket due to their common basename,
>     "Makefile". However, now they land in "close" buckets.
>     
>     The algorithm allows not just for the _same_ bucket, but for _close_
>     buckets to be considered delta candidates.  The rationale is
>     essentially that files, like Makefiles, often have very similar
>     content no matter what directory they live in.
> ---- 8< ----
> 
> 
> So yes, exactly as you say with pack.writeBitmapHashCache=true (ae4f07fb) the
> delta-search heuristics is almost as efficient as with just raw filenames.
> 
> I can confirm this also via e.g. (with my patch applied) :
> 
> ---- 8< ----
> $ time echo 0186ac99 | git pack-objects --no-use-bitmap-index --revs erp5pack-plain
> Counting objects: 627171, done.
> Compressing objects: 100% (176949/176949), done.
> 50570987560d481742af4a8083028c2322a0534a
> Writing objects: 100% (627171/627171), done.
> Total 627171 (delta 439404), reused 594820 (delta 410210)
> 
> real    0m37.272s
> user    0m33.648s
> sys     0m1.580s
> 
> $ time echo 0186ac99 | git pack-objects --revs erp5pack-bitmap
> Counting objects: 627171, done.
> Compressing objects: 100% (176914/176914), done.
> 7c15a9b1eca1326e679297b217c5a48954625ca2
> Writing objects: 100% (627171/627171), done.
> Total 627171 (delta 439484), reused 594855 (delta 410245)
> 
> real    0m27.020s
> user    0m23.364s
> sys     0m0.992s
> 
> $ ll erp5pack-{plain,bitmap}*
> 	  17561860  erp5pack-bitmap-7c15a9b1eca1326e679297b217c5a48954625ca2.idx
> 	 238760161  erp5pack-bitmap-7c15a9b1eca1326e679297b217c5a48954625ca2.pack
> 	  17561860  erp5pack-plain-50570987560d481742af4a8083028c2322a0534a.idx
> 	 238634201  erp5pack-plain-50570987560d481742af4a8083028c2322a0534a.pack
> ---- 8< ----
> 
> ( By the way about pack generated with bitmap retaining close recency
>   order:
> 
>   ---- 8< ----
>   $ git verify-pack -v erp5pack-plain-50570987560d481742af4a8083028c2322a0534a.pack >1
>   $ git verify-pack -v erp5pack-bitmap-7c15a9b1eca1326e679297b217c5a48954625ca2.pack >2
>   $ grep commit 1 |awk '{print $1}' >1.commit
>   $ grep commit 2 |awk '{print $1}' >2.commit
>   $ wc -l 1.commit
>   46136 1.commit
>   $ wc -l 2.commit
>   46136 2.commit
>   $ diff -u0 1.commit 2.commit |wc -l
>   55
>   ---- 8< ----
>   
>   so 55/46136 shows it is very almost the same. )
> 
> 
> > There may be other subtle things, too. The general idea of tying the
> > bitmap use to pack_to_stdout is that you _do_ want to use it for
> > serving fetches and pushes, but for a full on-disk repack via gc, it's
> > more important to generate a good pack.
> 
> It is better we send good packs to clients too, right? And with
> pack.writeBitmapHashCache=true and retaining recency order (please see
> above, but again maybe I'm not completely right) to me we should be still
> generating a good pack while using bitmap reachability index for object
> graph traversal.
> 
> > Your use case:
> > 
> > > git-backup extracts many packs on repositories restoration. That was my
> > > initial motivation for the patch.
> > 
> > Seems to be somewhere in between. I'm not sure I understand how you're
> > invoking pack-objects here,
> 
> It is just
> 
>     pack-objects --revs --reuse-object --reuse-delta --delta-base-offset extractedrepo/objects/pack/pack  < SHA1-HEADS
> 
>     https://lab.nexedi.com/kirr/git-backup/blob/7fcb8c67/git-backup.go#L829
> 
> > but I wonder if you should be using "pack-objects --stdout" yourself.
> 
> I already tried --stdout. The problem is on repository extraction we
> need to both extract the pack and index it. While `pack-object file`
> does both, for --stdout case we need to additionally index extracted
> pack with `git index-pack`, and standalone `git index-pack` is very slow
> - in my experience much slower than generating the pack itself:
> 
> ---- 8< ----
> $ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack
> Counting objects: 627171, done.
> Compressing objects: 100% (176914/176914), done.
> Total 627171 (delta 439484), reused 594855 (delta 410245)
> 
> real    0m22.309s
> user    0m21.148s
> sys     0m0.932s
> 
> $ ll erp5pack-stdout*
>         238760161   erp5pack-stdout.pack
> 
> $ time git index-pack erp5pack-stdout.pack
> 7c15a9b1eca1326e679297b217c5a48954625ca2
> 
> real    0m50.873s   <-- more than 2 times slower than time to generate pack itself!
> user    0m49.300s
> sys     0m1.360s
> 
> $ ll erp5pack-stdout*
>          17561860   erp5pack-stdout.idx
>         238760161   erp5pack-stdout.pack
> ---- 8< ----
> 
> So the time for
> 
>     `pack-object --stdout >file.pack` + `index-pack file.pack`  is  72s,
>     
> while
> 
>     `pack-objects file.pack` which does both pack and index     is  27s.
>     
> And even
> 
>     `pack-objects --no-use-bitmap-index file.pack`              is  37s.
> 
> 
> I've tried to briefly see why index-pack is so slow and offhand I can
> see that it needs to load all objects, decompresses them etc (maybe I'm
> not so right here - I looked only briefly), while pack-objects while
> generating the pack has all needed information directly at hand and thus
> can emit index much more easily.
> 
> For sever - clients scenario, index-pack load is put onto clients thus
> offloading server, but for my use case where extracted repository is on
> the same machine the load does not go away.
> 
> That's why for me it makes more sense to emit both pack and its index in
> one go.
> 
> Still it would be interesting to eventually see why index-pack is so
> anomaly slow.
> 
> > But even if it is the right thing for your use case to be using bitmaps
> > to generate an on-disk bitmap, I think we should be making sure it
> > _doesn't_ trigger when doing a normal repack.
> 
> So seems the way forward here is to teach pack-objects not to silently
> drop explicit --use-pack-bitmap for cases when it can handle it?
> (currently even if this option was given, for !stdout cases pack-objects
> simply drop use_bitmap_index to 0).
> 
> And to make sure default for use_bitmap_index is 0 for !stdout cases?
> 
> Or are we fine with my arguments about recency order staying the same
> when using bitmap reachability index for object graph traversal, and this
> way the patch is fine to go in as it is?

Since there is no reply I assume the safe way to go is to let default
for pack-to-file case to be "not using bitmap index". Please find updated
patch and interdiff below. I would still be grateful for feedback on
my above use-bitmap-for-pack-to-file arguments.

Thanks,
Kirill

(interdiff)
diff --git a/Documentation/config.txt b/Documentation/config.txt
index e455fae..1888f42 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -2241,12 +2241,20 @@ pack.packSizeLimit::
 	Common unit suffixes of 'k', 'm', or 'g' are
 	supported.
 
-pack.useBitmaps::
+pack.useBitmaps (deprecated)::
+	This is a deprecated synonym for `pack.useBitmaps.stdout`.
+
+pack.useBitmaps.stdout::
 	When true, git will use pack bitmaps (if available) when packing
 	to stdout (e.g., during the server side of a fetch). Defaults to
 	true. You should not generally need to turn this off unless
 	you are debugging pack bitmaps.
 
+pack.useBitmaps.file::
+	When true, git will use pack bitmaps (if available) when packing
+	to file (e.g., on repack). Defaults to false. You should not
+	generally need to turn this on unless you know what you are doing.
+
 pack.writeBitmaps (deprecated)::
 	This is a deprecated synonym for `repack.writeBitmaps`.
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index be0ebe8..7aaa1af 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -66,7 +66,8 @@ static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
 static off_t reuse_packfile_offset;
 
-static int use_bitmap_index = 1;
+static int use_bitmap_stdout = 1, use_bitmap_file = 0;
+static int use_bitmap_index;
 static int write_bitmap_index;
 static uint16_t write_bitmap_options;
 
@@ -2227,8 +2228,12 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		else
 			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
-	if (!strcmp(k, "pack.usebitmaps")) {
-		use_bitmap_index = git_config_bool(k, v);
+	if (!strcmp(k, "pack.usebitmaps") || !strcmp(k, "pack.usebitmaps.stdout")) {
+		use_bitmap_stdout = git_config_bool(k, v);
+		return 0;
+	}
+	if (!strcmp(k, "pack.usebitmaps.file")) {
+		use_bitmap_file = git_config_bool(k, v);
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -2705,6 +2710,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	reset_pack_idx_option(&pack_idx_opts);
 	git_config(git_pack_config, NULL);
+	use_bitmap_index = pack_to_stdout ? use_bitmap_stdout : use_bitmap_file;
 	if (!pack_compression_seen && core_compression_seen)
 		pack_compression_level = core_compression_level;
 
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 533fc31..9fab2bb 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -122,9 +122,14 @@ test_expect_success 'pack-objects to file can use bitmap' '
 	# make sure we still have 1 bitmap index from previous tests
 	ls .git/objects/pack/ | grep bitmap >output &&
 	test_line_count = 1 output &&
-	# pack-objects uses bitmap index by default, when it is available
-	packsha1=$(git pack-objects --all mypack </dev/null) &&
-	git verify-pack mypack-$packsha1.pack
+	# verify equivalent packs are generated with/without using bitmap index
+	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
+	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+	git verify-pack -v packa-$packasha1.pack >packa.verify &&
+	git verify-pack -v packb-$packbsha1.pack >packb.verify &&
+	grep -o "^$_x40" packa.verify |sort >packa.objects &&
+	grep -o "^$_x40" packb.verify |sort >packb.objects &&
+	test_cmp packa.objects packb.objects
 '
 
 test_expect_success 'full repack, reusing previous bitmaps' '


---- 8< ----
From: Kirill Smelkov <kirr@xxxxxxxxxx>
Date: Thu, 7 Jul 2016 20:12:00 +0300
Subject: [PATCH v2] pack-objects: Teach it to use reachability bitmap index when
 generating non-stdout pack too

Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
if a repository has bitmap index, pack-objects can nicely speedup
"Counting objects" graph traversal phase. That however was done only for
case when resultant pack is sent to stdout, not written into a file.

We can teach pack-objects to use bitmap index for initial object
counting phase when generating resultant pack file too:

- if we know bitmap index generation is not enabled for resultant pack:

  Current code has singleton bitmap_git so cannot work simultaneously
  with two bitmap indices.

- if we keep pack reuse enabled still only for "send-to-stdout" case:

  Because on pack reuse raw entries are directly written out to destination
  pack by write_reused_pack() bypassing needed for pack index generation
  bookkeeping done by regular codepath in write_one() and friends.

  (at least that's my understanding after briefly looking at the code)

We also need to care and teach add_object_entry_from_bitmap() to respect
--local via not adding nonlocal loose object to resultant pack (this
is bitmap-codepath counterpart of daae0625 (pack-objects: extend --local
to mean ignore non-local loose objects too) -- not to break 'loose
objects in alternate ODB are not repacked' in t7700-repack.sh .

Otherwise all git tests pass, and for pack-objects -> file we get nice
speedup:

    erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup
    repository managed by git-backup[2] via

    time echo 0186ac99 | git pack-objects --revs erp5pack

before:  37.2s
after:   26.2s

And for `git repack -adb` packed git.git

    time echo 5c589a73 | git pack-objects --revs gitpack

before:   7.1s
after:    3.6s

i.e. it can be 30% - 50% speedup for pack extraction.

git-backup extracts many packs on repositories restoration. That was my
initial motivation for the patch.

[1] https://lab.nexedi.com/nexedi/erp5
[2] https://lab.nexedi.com/kirr/git-backup

NOTE

Jeff King suggested that it might be not generally a good idea to
use bitmap reachability index when repacking a repository. For this
reason when packing to a file the default is not to use bitmap, while
for packing-to-stdout case the default stays to be "bitmap is used".

The defaults can be configured with

    pack.useBitmaps.stdout      (renamed from pack.useBitmaps), and
    pack.useBitmaps.file

More context:

    http://article.gmane.org/gmane.comp.version-control.git/299063
    http://article.gmane.org/gmane.comp.version-control.git/299107

Cc: Vicent Marti <tanoku@xxxxxxxxx>
Cc: Jeff King <peff@xxxxxxxx>
Signed-off-by: Kirill Smelkov <kirr@xxxxxxxxxx>
---
 Documentation/config.txt | 10 +++++++++-
 builtin/pack-objects.c   | 19 ++++++++++++++-----
 t/t5310-pack-bitmaps.sh  | 14 ++++++++++++++
 3 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index e455fae..1888f42 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -2241,12 +2241,20 @@ pack.packSizeLimit::
 	Common unit suffixes of 'k', 'm', or 'g' are
 	supported.
 
-pack.useBitmaps::
+pack.useBitmaps (deprecated)::
+	This is a deprecated synonym for `pack.useBitmaps.stdout`.
+
+pack.useBitmaps.stdout::
 	When true, git will use pack bitmaps (if available) when packing
 	to stdout (e.g., during the server side of a fetch). Defaults to
 	true. You should not generally need to turn this off unless
 	you are debugging pack bitmaps.
 
+pack.useBitmaps.file::
+	When true, git will use pack bitmaps (if available) when packing
+	to file (e.g., on repack). Defaults to false. You should not
+	generally need to turn this on unless you know what you are doing.
+
 pack.writeBitmaps (deprecated)::
 	This is a deprecated synonym for `repack.writeBitmaps`.
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a2f8cfd..7aaa1af 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -66,7 +66,8 @@ static struct packed_git *reuse_packfile;
 static uint32_t reuse_packfile_objects;
 static off_t reuse_packfile_offset;
 
-static int use_bitmap_index = 1;
+static int use_bitmap_stdout = 1, use_bitmap_file = 0;
+static int use_bitmap_index;
 static int write_bitmap_index;
 static uint16_t write_bitmap_options;
 
@@ -1052,6 +1053,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 {
 	uint32_t index_pos;
 
+	if (local && has_loose_object_nonlocal(sha1))
+		return 0;
+
 	if (have_duplicate_entry(sha1, 0, &index_pos))
 		return 0;
 
@@ -2224,8 +2228,12 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 		else
 			write_bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
-	if (!strcmp(k, "pack.usebitmaps")) {
-		use_bitmap_index = git_config_bool(k, v);
+	if (!strcmp(k, "pack.usebitmaps") || !strcmp(k, "pack.usebitmaps.stdout")) {
+		use_bitmap_stdout = git_config_bool(k, v);
+		return 0;
+	}
+	if (!strcmp(k, "pack.usebitmaps.file")) {
+		use_bitmap_file = git_config_bool(k, v);
 		return 0;
 	}
 	if (!strcmp(k, "pack.threads")) {
@@ -2488,7 +2496,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 	if (prepare_bitmap_walk(revs) < 0)
 		return -1;
 
-	if (pack_options_allow_reuse() &&
+	if (pack_options_allow_reuse() && pack_to_stdout &&
 	    !reuse_partial_packfile_from_bitmap(
 			&reuse_packfile,
 			&reuse_packfile_objects,
@@ -2702,6 +2710,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	reset_pack_idx_option(&pack_idx_opts);
 	git_config(git_pack_config, NULL);
+	use_bitmap_index = pack_to_stdout ? use_bitmap_stdout : use_bitmap_file;
 	if (!pack_compression_seen && core_compression_seen)
 		pack_compression_level = core_compression_level;
 
@@ -2773,7 +2782,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
-	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
+	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
 		use_bitmap_index = 0;
 
 	if (pack_to_stdout || !rev_list_all)
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 3893afd..9fab2bb 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -118,6 +118,20 @@ test_expect_success 'incremental repack can disable bitmaps' '
 	git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'pack-objects to file can use bitmap' '
+	# make sure we still have 1 bitmap index from previous tests
+	ls .git/objects/pack/ | grep bitmap >output &&
+	test_line_count = 1 output &&
+	# verify equivalent packs are generated with/without using bitmap index
+	packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
+	packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+	git verify-pack -v packa-$packasha1.pack >packa.verify &&
+	git verify-pack -v packb-$packbsha1.pack >packb.verify &&
+	grep -o "^$_x40" packa.verify |sort >packa.objects &&
+	grep -o "^$_x40" packb.verify |sort >packb.objects &&
+	test_cmp packa.objects packb.objects
+'
+
 test_expect_success 'full repack, reusing previous bitmaps' '
 	git repack -ad &&
 	ls .git/objects/pack/ | grep bitmap >output &&
-- 
2.9.0.431.g3cb5c84
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]