Re: [PATCH 0/6] PATH WALK I: The path-walk API

Jeff King <peff@xxxxxxxx> · Mon, 11 Nov 2024 16:55:02 -0500

On Mon, Nov 11, 2024 at 11:56:01AM +0900, Junio C Hamano wrote:

> > Yes. This is the downside of the --full-name-hash compared to the
> > standard name hash. When repacking an entire repository, the effect
> > of these renames is typically not important in the long run as it's
> > basically a single break in the delta chain. The downside comes in
> > when doing a small fetch or push where the rename has more impact.
> 
> Yes.  Due to --depth limit, we need to break delta chains somewhere
> anyway, and a rename boundary is just as good place as any other in
> a sufficiently long chain.

We don't necessarily have to break the chains due to depth limits,
because they are not always linear. They can end up as bushy trees,
like:

  A <- B <- C
   \
    <- D <- E
     \
      <- F

We might get that way because it mirrors the shape of history (e.g., if
D and E happened on a side branch from B and C). But we can also get
there from a linear history by choosing a delta that balances size
versus depth. In the above example, the smallest sequence might be:

  A <- B <- C <- D <- E <- F

but if we have a limit of 3 and A-B-C already exists, then we might
reject the C-D delta and choose A-D instead (or I guess if it really is
linear, probably B-D is more likely). That may be a larger delta by
itself, but it is still better than storing a full copy of the object.
And we find it by having those candidates close together in the sorted
list, reject C-D based on depth, and then moving on to the next
candidate.

I'm not sure in practice how often we find these kinds of deltas. If you
look at, say, all the deltas for "Makefile" in git.git like this:

  git rev-list --objects --all --reflog --full-history -- Makefile |
  perl -lne 'print $1 if /(.*) Makefile/' |
  git cat-file --batch-check='%(objectsize:disk) %(objectname) %(deltabase)'

my repo has 33 full copies (you can see non-deltas by grepping for
"0{40}$" in the output) out of 4473 total. So it's not like we never
break chains. But we can use graphviz to visualize it by piping the
above through:

  perl -alne '
    BEGIN { print "digraph {" }
    print "node_$F[1] [label=$F[0]]";
    print "node_$F[1] -> node_$F[2]" if $F[2] !~ /^0+$/;
    END { print "}" }
  '

and then piping it through "dot" to make an svg, or using an interactive
viewer like "xdot" (the labels are the on-disk size of each object). I
see a lot of wide parts of the graph in the output.

Of course this may all depend on packing patterns, too. I did my
investigations after running "git repack -adf" to generate what should
be a pretty reasonable pack. You might see something different from
incremental repacking over time.

I'm not sure what any of this means for --path-walk, of course. ;)
Ultimately we care about resulting size and time to compute, so if it
can do better on those metrics then it doesn't matter what the graph
looks like. But maybe those tools can help us understand where things
might go better (or worse) with various approaches.

-Peff