Re: [PATCH 0/6] PATH WALK I: The path-walk API

Derrick Stolee <stolee@xxxxxxxxx> · Mon, 4 Nov 2024 10:48:49 -0500

On 11/1/24 3:23 PM, Taylor Blau wrote:
Hi Stolee,

On Thu, Oct 31, 2024 at 06:26:57AM +0000, Derrick Stolee via GitGitGadget wrote:

Introduction and relation to prior series
=========================================

This is a new series that rerolls the initial "path-walk API" patches of my
RFC [1] "Path-walk API and applications". This new API (in path-walk.c and
path-walk.h) presents a new way to walk objects such that trees and blobs
are walked in batches according to their path.

This also replaces the previous version of ds/path-walk that was being
reviewed in [2]. The consensus was that the series was too long/dense and
could use some reduction in size. This series takes the first few patches,
but also makes some updates (which will be described later).

[1]
https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@xxxxxxxxx/
[RFC] Path-walk API and applications

[2]
https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@xxxxxxxxx/
[PATCH v2 00/17] pack-objects: add --path-walk option for better deltas

I apologize for not having a better place to start discussing a topic
which pertains to more than just this immediate patch series, but I
figure here is as good a place as any to do so.

 From our earlier discussion, it seems to stand that the path-walk API
is fundamentally incompatible with reachability bitmaps and
delta-islands, making the series a non-starter in environments that
rely significantly one or both of those features. My understanding as a
result is that the path-walk API and feature are more targeted towards
improving client-side repacks and push performance, where neither of the
aforementioned two features are used quite as commonly.

This is correct. I would go even farther to say that this approach was
designed first and foremost for Git clients and specifically their
performance while computing a thin packfile during "git push". The same
logic to help the push case happens to also help the "git repack" case
significantly.

I was discussing this a bit off-list with Peff (who I hope will join the
thread and share his own thoughts), but I wonder if it was a mistake to
discard your '--full-name-hash' idea (or something similar, which I'll
discuss in a bit below) from earlier.

I'd be happy to resurrect that series, adding in the learnings from
working on the path-walk feature. It helps that the current series adds
the path-walk API and has no conflicting changes in the pack-objects or
repack builtins. (I can handle those conflicts as things merge down.)

(Repeating a few things that I am sure are obvious to you out loud so
that I can get a grasp on them for my own understanding):

It seems that the problems you've identified which result in poor repack
performance occur when you have files at the same path, but they get
poorly sorted in the delta selection window due to other paths having
the same final 16 characters, so Git doesn't see that much better delta
opportunities exist.

Your series takes into account the full name when hashing, which seems
to produce a clear win in many cases. I'm sure that there are some cases
where it presents a modest regression in pack sizes, but I think that's
fine and probably par for the course when making any changes like this,
as there is probably no easy silver bullet here that uniformly improves
all cases.

I suspect that you could go even further and intern the full path at
which each object occurs, and sort lexically by that. Just stringing
together all of the paths in linux.git only takes 3.099 MiB on my clone.
(Of course, that's unbounded in the number of objects and length of
their pathnames, but you could at least bound the latter by taking only
the last, say, 128 characters, which would be more than good enough for
the kernel, whose longest path is only 102 characters).

When the optimization idea is to focus on the full path and not care
about the "locality" of the path name by its later bits, storing the
full name in a list and storing an index into that list would have a
very similar effect.

I'd be interested to explore the idea of storing the full path name.
Based on my exploration with the 'test-tool name-hash' test helper in
that series, I'm not sure that we will make significant gains by doing
so. Worth trying.

Some of the repositories that you've tested on I don't have easy access
to, so I wonder if either doing (a) that, or (b) using some fancier
context-sensitive hash (like SimHash or MinHash) would be beneficial.

I don't know too much about SimHash or MinHash, but based on what I
could gather from some initial reading I'm not sure that they would be
effective without increasing the hash length. We'd also get a different
kind of locality, such as the appearance of a common word would be more
likely to affect the locality than the end of the path.

The size of the hash could probably be mitigated by storing it in the
list of all full paths and accessing them from the index stored on each
to-pack object.

I realize that this is taking us back to an idea you've already
presented to the list, but I think (to me, at least) the benefit and
simplicity of that approach has only become clear to me in hindsight
when seeing some alternatives. I would like to apologize for the time
you spent reworking this series back and forth to have the response be
"maybe we should have just done the first thing you suggested". Like I
said, I think to me it was really only clear in hindsight.

I always assumed that we'd come back to it eventually. There is also the
extra bit about making the change to the name-hash compatible with the
way name-hashes are stored in the reachability bitmaps. That will need
some work before it is ready for prime time.

In any event, the major benefit to doing --full-name-hash would be that
*all* environments could benefit from the size reduction, not just those
that don't rely on certain other features.

I disagree that all environments will prefer the --full-name-hash. I'm
currently repeating the performance tests right now, and I've added one.
The issues are:

 1. The --full-name-hash approach sometimes leads to a larger pack when
    using "git push" on the client, especially when the name-hash is
    already effective for compressing across paths.

 2. A depth 1 shallow clone cannot use previous versions of a path, so
    those situations will want to use the normal name hash. This can be
    accomplished simply by disabling the --full-name-hash option when
    the --shallow option is present; a more detailed version could be
    used to check for a large depth before disabling it. This case also
    disables bitmaps, so that isn't something to worry about.

Perhaps just --full-name-hash isn't quite as good by itself as the
--path-walk implementation that this series starts us off implementing.
So in that sense, maybe we want both, which I understand was the
original approach. I see a couple of options here:

   - We take both, because doing --path-walk on top represents a
     significant enough improvement that we are collectively OK with
     taking on more code to improve a more narrow (but common) use-case.

Doing both doesn't help at all, since the --path-walk approach already
batches by the full path name. The --path-walk approach has a significant
benefit by doing a second pass by the standard name-hash to pick up on the
cross-path deltas. This is why the --path-walk approach with the standard
name hash as consistently provided the most-compact pack-files in all
tests.

  Aside: there were some initial tests that showed the --path-walk option
  led to slightly larger packfiles, but I've since discovered that those
  cases were due to an incorrect walking of indexed paths. This is fixed
  by the code in patch 5 of the current series and my WIP patches in [3]
  have the performance numbers with this change.

[3] https://github.com/gitgitgadget/git/pull/1819
PATH WALK II: Add --path-walk option to 'git pack-objects'

   - Or we decide that either the benefit isn't significant enough to
     warrant an additional and relatively complex implementation, or in
     other words that --full-name-hash by itself is good enough.

I hope that I've sufficiently communicated that --full-name-hash is not
good enough by itself.

The point I was trying to make by submitting it first was that I believed
it was likely the easiest way for Git servers to gain 90% of the benefits
that the --path-walk approach provides while making it relatively easy to
integrate with other server-side features such as bitmaps and delta islands.

(Maybe the --path-walk approach could also be extended to be compatible
with those features, but it would be a significant investment that rebuilds
those features within the context of the new object walk instead of relying
on the existing implementations. That could easily be a blocker.)

Again, I apologize for not having a clearer picture of this all to start
with, and I want to tell you specifically and sincerely that I
appreciate your patience as I wrap my head around all of this. I think
the benefit of --full-name-hash is much clearer and appealing to me now
having had both more time and seeing the series approached in a couple
of different ways. Let me know what you think.
Thanks for taking the time to engage with the patches. I'm currently
rerunning my performance tests on a rebased copy of the --full-name-hash
patches and will submit a new version when it's ready.

Thanks,
-Stolee