Re: [PATCH v5] pack-objects mru

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Aug 11, 2016 at 05:20:30AM -0400, Jeff King wrote:

> Here it is. It ended up needing a few preparatory patches.
> 
>   [1/4]: provide an initializer for "struct object_info"
>   [2/4]: sha1_file: make packed_object_info public
>   [3/4]: pack-objects: break delta cycles before delta-search phase
>   [4/4]: pack-objects: use mru list when iterating over packs
> 
> I had originally intended to include an extra patch to handle the
> --depth limits better. But after writing it, I'm not sure it's actually
> a good idea. I'll post it as an addendum with more discussion.

And here's the depth patch. It does work, as you can see by running the
snippet at the bottom of the commit message.

But I began to wonder if it's actually a desirable thing. For instance,
if you do:

  git gc --aggressive
  ... time passes ...
  git gc

the first gc may generate chains up to 250 objects. When the second gc
runs (and you may not even run it yourself; it might be "gc --auto"!),
it will generally reuse most of your existing deltas, even though the
default depth is only 50.

But with the patch below, it will drop deltas from the middle of those
long chains, undoing the prior --aggressive results. Worse, we don't
find new deltas for those objects, because it falls afoul of the "they
are in the same pack, so don't bother looking for a delta" rule.

An --aggressive repack of my git.git is 52MB. Repacking that with the
patch below and "--depth=50" bumps it to 55MB. Dumping the "do not
bother" condition in try_delta() drops that to 54MB.

So it _is_ worse for the space to drop those high-depth deltas. Even if
we fixed the "do not bother" (e.g., by recording a bit that says "even
though these are in the same pack, try anyway, because we had to break
the delta for other reasons"), it's still a loss.

OTOH, I am not altogether convinced that the tradeoff of a giant --depth
is worth. I'm looking only at the space here, but deeper delta chains
cost more CPU to access (especially if they start to exceed the delta
cache size). And the space savings aren't that amazing. Doing a "git
repack -adf --depth=50 --window=250" (i.e., if aggressive had just
tweaked the window size and not the depth in the first place), the
result is only 53MB.

So considering "--depth" as a space-saving measure for --aggressive does
not seem that effective. But it feels weird to quietly drop actions
people might have done with previous aggressive runs.

-- >8 --
Subject: [PATCH] pack-objects: enforce --depth limit in reused deltas

Since 898b14c (pack-objects: rework check_delta_limit usage,
2007-04-16), we check the delta depth limit only when
figuring out whether we should make a new delta. We don't
consider it at all when reusing deltas, which means that
packing once with --depth=50, and then against with
--depth=10, the second pack my still contain chains larger
than 10.

This is probably not a huge deal, as it is limited to
whatever chains you happened to create in a previous run.
But as we start allowing cross-pack delta reuse in a future
commit, this maximum will rise to the number of packs times
the per-pack depth (in the worst case; on average, it will
likely be much smaller).

We can easily detect this as part of the existing search for
cycles, since we visit every node in a depth-first way. That
lets us compute the depth of any node based on the depth of
its base, because we know the base is DFS_DONE by the time
we look at it (modulo any cycles in the graph, but we know
there cannot be any because we break them as we see them).

There is some subtlety worth mentioning, though. We record
the depth of each object as we compute it. It might seem
like we could save the per-object storage space by just
keeping track of the depth of our traversal (i.e., have
break_delta_chains() report how deep it went). But we may
visit an object through multiple delta paths, and on
subsequent paths we want to know its depth immediately,
without having to walk back down to its final base (doing so
would make our graph walk quadratic rather than linear).

Likewise, one could try to record the depth not from the
base, but from our starting point (i.e., start
recursion_depth at 0, and pass "recursion_depth + 1" to each
invocation of break_delta_chains()). And then when
recursion_depth gets too big, we know that we must cut the
delta chain.  But that technique is wrong if we do not visit
the nodes in topological order. In a chain A->B->C, it
if we visit "C", then "B", then "A", we will never recurse
deeper than 1 link (because we see at each node that we have
already visited it).

Unfortunately there is no automated test, because it's hard
to convince pack-objects to reliably produce delta chains.
Naively, it would seem that a sequence of ever-increasing
blobs would work. E.g., something like:

  for i in 1 2 3 4 5; do
          test-genrandom $i 4096 >>file
          git add file
          git commit -m $i
  done

where a reasonable set of deltas would use "1:file" as the
base, then "2:file" as a delta against that, "3:file" as a
delta against "2:file", and so on, until we have a chain
with length 5.

But the delta search is much more fickle than that. It tends
to prefer deletions to additions (because they are smaller
than additions), so it prefers "5:file" as the base, and
then the deltas just say "remove N bytes from the end".
Moreover, the delta search has heuristics that penalize
increasing depth. So packing the script above actually ends
up with 2 chains of length 2.

So I've punted on adding an automated test. One can see the
effect on a real-world repository by repacking it with a
small --depth value, like:

  max_depth() {
    for i in .git/objects/pack/*.pack; do
      git verify-pack -v $i
    done |
    perl -lne '
      /chain length = (\d+)/ or next;
      $max = $1 if $1 > $max;
      END { print $max }
    '
  }

  echo "before: $(max_depth)"
  git repack -ad --depth=5
  echo "after: $(max_depth)"

Signed-off-by: Jeff King <peff@xxxxxxxx>
---
 builtin/pack-objects.c | 15 +++++++++++++++
 pack-objects.h         |  4 ++++
 2 files changed, 19 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 32c1dba..d8132a4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1505,6 +1505,8 @@ static int pack_offset_sort(const void *_a, const void *_b)
  *   2. Updating our size/type to the non-delta representation. These were
  *      either not recorded initially (size) or overwritten with the delta type
  *      (type) when check_object() decided to reuse the delta.
+ *
+ *   3. Resetting our delta depth, as we are now a base object.
  */
 static void drop_reused_delta(struct object_entry *entry)
 {
@@ -1518,6 +1520,7 @@ static void drop_reused_delta(struct object_entry *entry)
 			p = &(*p)->delta_sibling;
 	}
 	entry->delta = NULL;
+	entry->depth = 0;
 
 	oi.sizep = &entry->size;
 	oi.typep = &entry->type;
@@ -1536,6 +1539,9 @@ static void drop_reused_delta(struct object_entry *entry)
  * Follow the chain of deltas from this entry onward, throwing away any links
  * that cause us to hit a cycle (as determined by the DFS state flags in
  * the entries).
+ *
+ * We also detect too-long reused chains that would violate our --depth
+ * limit.
  */
 static void break_delta_chains(struct object_entry *entry)
 {
@@ -1553,6 +1559,15 @@ static void break_delta_chains(struct object_entry *entry)
 		 */
 		entry->dfs_state = DFS_ACTIVE;
 		break_delta_chains(entry->delta);
+
+		/*
+		 * Once we've recursed, our base knows its depth, so we can
+		 * compute ours (and check it against the limit).
+		 */
+		entry->depth = entry->delta->depth + 1;
+		if (entry->depth > depth)
+			drop_reused_delta(entry);
+
 		entry->dfs_state = DFS_DONE;
 		break;
 
diff --git a/pack-objects.h b/pack-objects.h
index cc9b9a9..03f1191 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -30,12 +30,16 @@ struct object_entry {
 
 	/*
 	 * State flags for depth-first search used for analyzing delta cycles.
+	 *
+	 * The depth is measured in delta-links to the base (so if A is a delta
+	 * against B, then A has a depth of 1, and B a depth of 0).
 	 */
 	enum {
 		DFS_NONE = 0,
 		DFS_ACTIVE,
 		DFS_DONE
 	} dfs_state;
+	int depth;
 };
 
 struct packing_data {
-- 
2.9.2.790.gaa5bc72

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]