[PATCH 00/30] [RFC] Path-walk API and applications

"Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> · Tue, 10 Sep 2024 02:28:25 +0000

This RFC is ultimately about introducing a new way to walk objects, called
the "path-walk API" in the new path-walk.[ch] files. Before digging into the
details of the API, let's discuss the applications which will hint at the
API's design.

APPLICATIONS OF THE PATH-WALK API
=================================

The applications of this API were discovered in the following order, though
I recommend reversing the order for the priority of their actual
implementation in future patch series:

 * git backfill: a builtin to download missing blobs in a blobless partial
   clone, done in batches and grouped by the path they appear in to maximize
   delta compression in each batch. Allows focusing on the paths of the
   sparse-checkout to only get the blobs necessary for history queries in
   the current focus.

 * git survey: Jeff Hostetler built this feature [1] as a way to get
   functionality similar to git-sizer [2], but using the internals of Git to
   do it faster. It also displays information not available to git-sizer,
   like the on-disk size of objects. This RFC presents a simplified version
   of the builtin focused on listing the paths that contribute the most to
   the on-disk size of trees and blobs.

 * git pack-objects --path-walk: In order to find a way to compute deltas
   among objects of the same path, I applied the path-walk API to 'git
   pack-objects' behind an optional flag. There are overlaps with the
   '--sparse' option [3], [4] that can be used here. This provides perfect
   partitioning by path name, without any possible collisions from the
   name-hash algorithm. It also allows using the name-hash values to find
   cross-path delta chains in a second pass.

 * git repack --full-name-hash: If we are worried about name-hsah
   collisions, an easier thing to implement is a different name-hash
   algorithm that is less likely to have collisions. This feature was
   already sent to the mailing list as a fully-reviewable series [5]. It is
   included here because this series allows testing the --path-walk option
   against the --full-name-hash.

[1] https://github.com/microsoft/git/pull/667 [2]
https://github.com/github/git-sizer [3]
https://github.com/git/git/compare/5d826e972970a784bd7a7bdf587512510097b8c7...99dbbfa8ddbba2b620965d026d4ec199b8837a6f
[4]
https://devblogs.microsoft.com/devops/exploring-new-frontiers-for-git-push-performance/
[5]
https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@xxxxxxxxx

TIMELINE FOR CREATING THESE APPLICATIONS IN THIS ORDER
======================================================

Here's the story about how these applications came about: I was tasked with
understanding why certain internal repositories were growing larger than
expected. (Feel free to skip. Otherwise, thank you for indulging me.)

I first prototyped 'git backfill' as a way to download some of the largest
repositories without being blocked on a full clone. This batched download
mechanism allowed me to essentially have a retryable clone, since the client
could restart the process from scratch and skip any objects that were
already on disk. It was natural to batch based on the path of the blobs in
order to theoretically save time and network bandwidth due to better delta
calculations.

While investigating these repositories, I had some data hinting at the total
size of the objects by type. But I was most interested in learning what
exactly was causing this growth. I did have a hint that the "release"
branches were taking up much more space than the default branch, since
cloning with --single-branch resulted in ~55GB of data but then fetching the
release branches led to an additional ~125GB of data. Using "git diff" it
was clear that these branches stored some CHANGELOG.json and CHANGELOG.md
files, but the diffs were relatively small (but multiple such changes were
batched together into a single new commit). I needed to see why exactly
these paths were taking up so much space.

To this, I turned to Jeff Hostetler's new "git survey" command. This told me
information about the total size of trees and blobs and told me the path
name for the individual blobs that were the largest in the repository. These
paths were typically binary files that appeared only once or twice and did
not account for the scale issues. I modified 'git survey' to use the same
path-batching logic as in 'git backfill' to then consider each batch of
objects and their size. Thus, the "path-walk API" was born. (This RFC
contains a version that looks like it was created before 'git backfill'.)

The repository I was looking at had a clear pattern in its top 100 file
paths by on-disk size: 99 of them were CHANGELOG.json and CHANGELOG.md
files. The .md files surprised me, since they were always simple appends of
the previous .md file at the same path. The .json files were capped in how
many versions were being listed (at least in recent versions) but the data
it stored was harder to compress. So I went looking into how 'git push' was
calculating these delta bases. Adding some debug information to 'git
pack-objects' demonstrated that the previous file versions were not being
matched as delta bases. Instead, other new blobs in the push were being used
as delta bases.

This meant that what should have been a trivial set of deltas bloated to
20-60 MB. (We will see later that it is possible for these to be 100-500
KB.)

Here is where I went on a little bit of a detour. (Come with me, it's
important.) I knew that 'git pack-objects' used a name-hash to group objects
by path, so I assumed that the reason these delta bases were not found was
because the UNINTERESTING objects were not being added to the packing list.
I have since discovered that this is incorrect, but I might have gotten
stuck if I didn't think this.

This seemed like a natural reason to extend the path-walk API to allow
walking commits and tags as part of 'git pack-objects' behind a new
'--path-walk' option. The idea here is to compute deltas among the objects
that share a common path and then later go through the (type, name-hash,
size) sorting system to find other delta bases across path boundaries. After
a lot of testing, failing, and testing again, the implementation in this RFC
finally works to achieve the goal. It's not pretty (especially with how it
handles tags) but it gets the job done.

In hindsight, I realized that the UNINTERESTING objects were being
considered, but due to collisions in the name-hash algorithm these objects
were being sorted outside of the delta computation window. For this reason,
I thought to create a new name-hash algorithm. Thus, the --full-name-hash
option for 'git pack-objects' and 'git repack' was born. This feature was
split out and sent to the mailing list independently from this RFC.

RFC GOALS
=========

The goals of this RFC are:

 1. To demonstrate potential applications of the path-walk API to motivate
    its generality as these features are sent in full-quality patch series,
    but in a different order.

 2. To communicate the discoveries found during the --path-walk and
    --full-name-hash features in 'git pack-objects' and 'git repack'. This
    includes comparing and contrasting the effectiveness of these features.

 3. To demonstrate the value of the path-based batching in the 'git survey'
    feature, and to inspire others to think about what other statistics
    would be valuable in that feature. (I anticipate that once a base is
    established, multiple contributors will help expand its functionality
    long into the future.)

RFC OUTLINE
===========

The patches are grouped roughly by the application, in order of discovery:

PART I: 'git backfill'
======================

These patches introduce the 'git backfill' builtin including its
'--batch-size' and '--sparse' options. While this is the first and simplest
application, it is also the lowest priority in terms of user need.

 * path-walk: introduce an object walk by path
 * backfill: add builtin boilerplate
 * backfill: basic functionality and tests
 * backfill: add --batch-size= option
 * backfill: add --sparse option
 * backfill: assume --sparse when sparse-checkout is enabled

PART II: 'git survey'
=====================

These patches reimplement a subset of the functionality of 'git survey' as
well as generalize some of the data structures that Jeff's implementation
made. The flexibility hopefully comes through to show the potential for
future extensions. These patches are quite rough and will need more
attention before they can be sent for full review.

 * path-walk: allow consumer to specify object types
 * path-walk: allow visiting tags
 * survey: stub in new experimental git-survey command
 * survey: add command line opts to select references
 * survey: collect the set of requested refs
 * survey: start pretty printing data in table form
 * survey: add object count summary
 * survey: summarize total sizes by object type
 * survey: show progress during object walk
 * survey: add ability to track prioritized lists
 * survey: add report of "largest" paths

PART III: 'git pack-objects --path-walk'
========================================

Here is where I think the meat of the RFC really lies. There are still some
rough edges, but the data will show that 'git pack-objects --path-walk' has
the potential to be an extremely effective way to pack objects. (Caveats
will come later in the analysis section.)

 * revision: create mark_trees_uninteresting_dense()
 * path-walk: add prune_all_uninteresting option
 * pack-objects: add --path-walk option
 * pack-objects: extract should_attempt_deltas()
 * pack-objects: introduce GIT_TEST_PACK_PATH_WALK
 * p5313: add size comparison test
 * repack: add --path-walk option
 * pack-objects: enable --path-walk via config
 * scalar: enable path-walk during push via config

One obvious issue with this current implementation is that it inlines much
of the delta calculation into the "Enumerate objects" phase, and thus makes
it single-threaded. This should be fixed before being considered for full
review, but even without threading this version can out-perform the standard
packing strategy in terms of end-to-end packing time!

PART IV: 'git repack --full-name-hash'
======================================

This is a simplified version of the patch series that was split out by
itself earlier for full review. This was split out on its own partly because
it doesn't actually use the path-walk API. This has benefits and drawbacks,
but it seems like a quick win for many scenarios.

 * pack-objects: add --full-name-hash option
 * test-name-hash: add helper to compute name-hash functions
 * p5314: add a size test for name-hash collisions
 * pack-objects: output debug info about deltas

This last patch is an add-on of the debugging information that I used to
discover issues with delta bases during 'git push'.

DISCUSSION OF NAME HASH
=======================

One thing to talk about before digging into --path-walk and --full-name-hash
features is the existing name-hash algorithm. This hash algorithm creates a
uint32_t based on the final 16 characters of the path name, weighing the
last characters more. There are multiple benefits to this:

 1. Files of common types (.c, .txt, ...) may be grouped together.

 2. Files that are renamed across directories may be grouped together.

(Thanks, Junio, for making this second benefit clear.)

The issue here is that some common patterns arise in repositories that use
common path names across directories, and those files are creating name-hash
collisions and making the sort less effective. One thing that can counteract
these collisions is to increase the --window setting, but this significantly
slows the delta computations.

Thus, the --path-walk and --full-name-hash features both attempt to combat
these name-hash collisions in very different ways. The --path-walk mechanism
uses the path-walk API to consider batches of objects that all share the
same path. This avoids storing every possible path names in memory while
doing the object walk, but still gives nice boundaries for delta compression
possibilities. After the path-walk is complete, the full packing list is
still sorted via name-hash and this allows for cross-path deltas. This is
critical!

The --full-name-hash feature does the simpler choice of replacing the
name-hash method with one that has fewer collisions, but loses the benefits
of "nearby" paths having close hash values.

This naturally leads to these two main differences in the two approaches:

 1. The --full-name-hash feature is not good for 'git push' or similarly
    small pack-files. Since it limits the delta chains to objects with the
    same full path and loses the benefit of "nearby" paths, this feature
    should be used for larger repacks. In my testing, 'git push' simulations
    almost always have poor packing but 'git repack -adf' simulations have
    packing rivaling the --path-walk option.

 2. The --path-walk option changes the object ordering significantly,
    meaning it may not ever be appropriate to combine with advanced
    repacking features such as delta islands or even reachability bitmaps.
    While my testing has shown that the --path-walk repacks are the most
    efficient of all options, this limitation makes me hesitate to recommend
    it wider than client repositories.

One natural question to consider is to think about storing both the
name-hash and the full-name-hash and doing two delta passes, each one
sorting the objects by a different hash function. The first issue is that we
don't want to store two hash values per object, as that will significantly
increase the memory pressure during repacking. This could be side-stepped by
storing the full-name-hash in the packing list and then a second mapping
from full-name-hash to name-hash. However, that still leads to the two
passes taking extra time. The --path-walk approach is faster than even a
single pass. And in the right scenarios, the --full-name-hash option is very
close to the --path-walk results.

ANALYSIS OF PACKING STRATEGIES
==============================

Since I was focused on an internal monorepo that stored a large collection
of Javascript packages, it should be no surprise that I eventually found
other Javascript repositories that used similar tooling and thus had similar
issues with unexplained scale problems. I'll use these four repositories as
examples repeatedly, but one is actually public: microsoft/fluentui [6].
I'll use Repo B, C, and D for the others, in increasing size.

[6] https://github.com/microsoft/fluentui

In each of these repositories, doing a full repack ('git repack -adf') right
after cloning presents a sizeable reduction in space. This is expected for
servers not being optimized for exactly the reachable set I'm cloning. So,
I'll focus on how much the default repacking parameters compare to using the
--full-name-hash or --path-walk options:

| Repo     | Standard Repack | With --full-name-hash | With --path-walk |
|----------|-----------------|-----------------------|------------------|
| fluentui |         438 MB  |               168 MB  |          148 MB  |
| Repo B   |       6,255 MB  |               829 MB  |          778 MB  |
| Repo C   |      37,737 MB  |             7,125 MB  |        6,158 MB  |
| Repo D   |     130,049 MB  |             6,190 MB  |        4,432 MB  |

Hopefully these reductions show how much these name-hash collisions are
causing issues in these repositories.

For the fluentui repo, I'm also able to share highlights from the 'git
survey' output immediately after cloning and after repacking with the
--path-walk option.

First, we can consider the total reachable objects in each scenario:

TOTAL OBJECT SIZES BY TYPE
================================================
Object Type |  Count | Disk Size | Inflated Size
------------+--------+-----------+--------------
    Commits |  20579 |  10443669 |      15092790
      Trees | 276503 |  40212070 |     244429615
      Blobs | 294500 | 635365661 |   10791187920

TOTAL OBJECT SIZES BY TYPE
================================================
Object Type |  Count | Disk Size | Inflated Size
------------+--------+-----------+--------------
    Commits |  20579 |  10450605 |      15092790
      Trees | 276503 |  31136263 |     244429615
      Blobs | 294500 |  94442401 |   10791187920

Now, we can consider the top 10 file paths before and after repacking with
the --path-walk feature:

TOP FILES BY DISK SIZE
===================================================================
                           Path | Count | Disk Size | Inflated Size
--------------------------------+-------+-----------+--------------
                      yarn.lock |   802 |  58060531 |     889120488
 ...-experiments/CHANGELOG.json |   505 |  28439452 |     252723999
 ...fabric-react/CHANGELOG.json |  1270 |  25556510 |     902756623
 ...t-components/CHANGELOG.json |   176 |  20366365 |     244936649
 ...act-charting/CHANGELOG.json |   590 |  20106422 |     208224460
 ...e-components/CHANGELOG.json |   559 |  15761271 |     189061764
 ...act-examples/CHANGELOG.json |   577 |  13615569 |     234949961
 ...react-charting/CHANGELOG.md |   564 |  11205840 |     104337986
 .../experiments/CHANGELOG.json |   569 |  10596377 |     123662770
 ...i-fabric-react/CHANGELOG.md |  1263 |   8154248 |     261494258

TOP FILES BY DISK SIZE
===================================================================
                           Path | Count | Disk Size | Inflated Size
--------------------------------+-------+-----------+--------------
                      yarn.lock |   802 |   9909326 |     889120488
 ...iceBrandGuide_16Sep2016.pdf |     1 |   2106334 |       2186005
 ...out/src/images/download.jpg |     1 |   1845249 |       1846117
 ...fluent-ui-logo-inverted.png |     3 |   1370372 |       1447493
          .yarn/releases/cli.js |     1 |   1335657 |       6741614
 ...c/images/fluent-ui-logo.png |     3 |   1272902 |       1341139
 ...ages/fluent-ui-logo-dev.png |     3 |   1130989 |       1186897
 ...nents/public/SegoeUI-VF.ttf |     1 |   1074046 |       1844524
 ...ig/rush/npm-shrinkwrap.json |   138 |   1058531 |      89326567
 ...Accessibility_29Sep2016.pdf |     1 |    856621 |        927268

As we can see from this example, before the repack we are mainly seeing the
disk space be dominated by objects that appear at paths with CHANGELOG.json
or CHANGELOG.md, which are frequently hitting collisions with the default
name-hash. After the repack with the --path-walk feature, we see the largest
paths are what we expect: checked in binaries, frequently-edited yarn files,
and other hard-to- compress data.

The nice thing about this result is that we can point to files that are
taking up space because there is no other way around it, not that the naming
convention for the files is causing confusion during packing.

WHAT TO DO NOW?
===============

Thank you for reading this far. I hope that this has added context on the
patch series for the --full-name-hash option, but also provided the right
context for when I come back in a week or so with a review-ready version of
the --path-walk option.

These patches are rough and I want to make sure everyone knows that.
Reviewing them for style or even basic organization may lead to some wasted
time. I know that at least one of the patches got mangled and its diff does
not match its description (I'm thinking specifically about "pack-objects:
extract should_attempt_deltas()" but this could apply elsewhere).

But I think that interested parties could take my branch, build it, and give
these features a try on their favorite repos. I'd love to hear feedback on
the usability and effectiveness, especially if someone finds a case where
the --path-walk option is less effective at packing data.

There are two things going on since I started working on this series that
could cause issues with applying this series on top of current 'master' or
with topics in 'seen':

 * The most obvious one is the collision with the --full-name-hash feature
   under review. The in-review series takes precedent.

 * The unused parameter warnings are now errors! I'm glad that got in, but I
   wasn't careful about it in these RFC patches.

 * John Cai is proposing to change the builtin interface to have builtin
   methods take a repository pointer. I approve of this effort, but it
   collides with the two new builtins introduced here. As discussed above,
   the 'git survey' and 'git backfill' builtins are the lowest priority
   items for getting merged, in my opinion, so they can wait until that
   change has been made.

I look forward to any and all comments.

Thanks! -Stolee

Derrick Stolee (27):
  path-walk: introduce an object walk by path
  backfill: add builtin boilerplate
  backfill: basic functionality and tests
  backfill: add --batch-size=<n> option
  backfill: add --sparse option
  backfill: assume --sparse when sparse-checkout is enabled
  path-walk: allow consumer to specify object types
  path-walk: allow visiting tags
  survey: start pretty printing data in table form
  survey: add object count summary
  survey: summarize total sizes by object type
  survey: show progress during object walk
  survey: add ability to track prioritized lists
  survey: add report of "largest" paths
  revision: create mark_trees_uninteresting_dense()
  path-walk: add prune_all_uninteresting option
  pack-objects: add --path-walk option
  pack-objects: extract should_attempt_deltas()
  pack-objects: introduce GIT_TEST_PACK_PATH_WALK
  p5313: add size comparison test
  repack: add --path-walk option
  pack-objects: enable --path-walk via config
  scalar: enable path-walk during push via config
  pack-objects: add --full-name-hash option
  test-name-hash: add helper to compute name-hash functions
  p5314: add a size test for name-hash collisions
  pack-objects: output debug info about deltas

Jeff Hostetler (3):
  survey: stub in new experimental `git-survey` command
  survey: add command line opts to select references
  survey: collect the set of requested refs

 .gitignore                     |   2 +
 Documentation/config/pack.txt  |   8 +
 Documentation/git-backfill.txt |  60 ++
 Documentation/git-survey.txt   |  70 +++
 Makefile                       |   4 +
 builtin.h                      |   2 +
 builtin/backfill.c             | 141 +++++
 builtin/pack-objects.c         | 298 ++++++++--
 builtin/repack.c               |  10 +
 builtin/survey.c               | 965 +++++++++++++++++++++++++++++++++
 command-list.txt               |   2 +
 git.c                          |   2 +
 pack-objects.h                 |  20 +
 path-walk.c                    | 401 ++++++++++++++
 path-walk.h                    |  73 +++
 repo-settings.c                |   3 +
 repository.h                   |   1 +
 revision.c                     |  15 +
 revision.h                     |   1 +
 scalar.c                       |   1 +
 t/README                       |   4 +
 t/helper/test-name-hash.c      |  23 +
 t/helper/test-tool.c           |   1 +
 t/helper/test-tool.h           |   1 +
 t/perf/p5313-pack-objects.sh   | 101 ++++
 t/perf/p5314-name-hash.sh      |  41 ++
 t/t5620-backfill.sh            | 181 +++++++
 t/t8100-git-survey.sh          |  76 +++
 28 files changed, 2469 insertions(+), 38 deletions(-)
 create mode 100644 Documentation/git-backfill.txt
 create mode 100644 Documentation/git-survey.txt
 create mode 100644 builtin/backfill.c
 create mode 100644 builtin/survey.c
 create mode 100644 path-walk.c
 create mode 100644 path-walk.h
 create mode 100644 t/helper/test-name-hash.c
 create mode 100755 t/perf/p5313-pack-objects.sh
 create mode 100755 t/perf/p5314-name-hash.sh
 create mode 100755 t/t5620-backfill.sh
 create mode 100755 t/t8100-git-survey.sh

base-commit: 17d4b10aea6bda2027047a0e3548a6f8ad667dde
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1786%2Fderrickstolee%2Fpath-walk-rfc-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1786/derrickstolee/path-walk-rfc-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1786
-- 
gitgitgadget