Re: [PATCH 00/30] [RFC] Path-walk API and applications

Christian Couder <christian.couder@xxxxxxxxx> · Tue, 17 Sep 2024 12:41:22 +0200

On Tue, Sep 10, 2024 at 4:29 AM Derrick Stolee via GitGitGadget
<gitgitgadget@xxxxxxxxx> wrote:
>
> This RFC is ultimately about introducing a new way to walk objects, called
> the "path-walk API" in the new path-walk.[ch] files. Before digging into the
> details of the API, let's discuss the applications which will hint at the
> API's design.
>
>
> APPLICATIONS OF THE PATH-WALK API
> =================================
>
> The applications of this API were discovered in the following order, though
> I recommend reversing the order for the priority of their actual
> implementation in future patch series:
>
>  * git backfill: a builtin to download missing blobs in a blobless partial
>    clone, done in batches and grouped by the path they appear in to maximize
>    delta compression in each batch. Allows focusing on the paths of the
>    sparse-checkout to only get the blobs necessary for history queries in
>    the current focus.

It's not very clear if this would be useful when doing a `git
sparse-checkout add`, or a `git blame` on a file path not covered by
the current sparse-checkout, or both. I think it would be clearer if
there were a few examples.

>  * git survey: Jeff Hostetler built this feature [1] as a way to get
>    functionality similar to git-sizer [2], but using the internals of Git to
>    do it faster. It also displays information not available to git-sizer,
>    like the on-disk size of objects. This RFC presents a simplified version
>    of the builtin focused on listing the paths that contribute the most to
>    the on-disk size of trees and blobs.

Not sure how `git survey` works, but `git sizer` works on a whole
repo, so, if they work in the same way, I am not sure I see what a new
path oriented way to walk repos would bring to the tool.

>  * git pack-objects --path-walk: In order to find a way to compute deltas
>    among objects of the same path, I applied the path-walk API to 'git
>    pack-objects' behind an optional flag. There are overlaps with the
>    '--sparse' option [3], [4] that can be used here. This provides perfect
>    partitioning by path name, without any possible collisions from the
>    name-hash algorithm. It also allows using the name-hash values to find
>    cross-path delta chains in a second pass.

Do you mean that the new way to walk repos allows pack-object to
perform better in the general case or maybe only in the case where
partial clone without blobs and sparse-checkout are used as described
in the git backfill related point above?

>  * git repack --full-name-hash: If we are worried about name-hsah

s/name-hsah/name-hash/

>    collisions, an easier thing to implement is a different name-hash
>    algorithm that is less likely to have collisions.

Maybe it could help to remind people a bit (perhaps in a note or using
a link) what name-hash collisions are and why we could be worried
about them.

Actually there is a "DISCUSSION OF NAME HASH" section below with
explanations, so maybe a note could just tell about that section.

> This feature was
>    already sent to the mailing list as a fully-reviewable series [5]. It is
>    included here because this series allows testing the --path-walk option
>    against the --full-name-hash.

Do you mean that the new way to walk repos can be used along with `git
repack --full-name-hash` (maybe with `git repack --full-name-hash
--path-walk` or a config option or otherwise?) and that it brings some
performance or other kind (better packs or which ones?) of
improvements?

> [1] https://github.com/microsoft/git/pull/667 [2]
> https://github.com/github/git-sizer [3]
> https://github.com/git/git/compare/5d826e972970a784bd7a7bdf587512510097b8c7...99dbbfa8ddbba2b620965d026d4ec199b8837a6f
> [4]
> https://devblogs.microsoft.com/devops/exploring-new-frontiers-for-git-push-performance/
> [5]
> https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@xxxxxxxxx

It looks like adding line breaks could help make the above link list
more readable.

> TIMELINE FOR CREATING THESE APPLICATIONS IN THIS ORDER
> ======================================================
>
> Here's the story about how these applications came about: I was tasked with
> understanding why certain internal repositories were growing larger than
> expected. (Feel free to skip. Otherwise, thank you for indulging me.)
>
> I first prototyped 'git backfill' as a way to download some of the largest
> repositories without being blocked on a full clone. This batched download
> mechanism allowed me to essentially have a retryable clone, since the client
> could restart the process from scratch and skip any objects that were
> already on disk. It was natural to batch based on the path of the blobs in
> order to theoretically save time and network bandwidth due to better delta
> calculations.
>
> While investigating these repositories, I had some data hinting at the total
> size of the objects by type.

By type (blob, tree, commit, tag?) or by path? Why would the size of
the objects by type be interesting? Could trees delta poorly?

> But I was most interested in learning what
> exactly was causing this growth. I did have a hint that the "release"
> branches were taking up much more space than the default branch, since
> cloning with --single-branch resulted in ~55GB of data but then fetching the
> release branches led to an additional ~125GB of data. Using "git diff" it
> was clear that these branches stored some CHANGELOG.json and CHANGELOG.md
> files, but the diffs were relatively small (but multiple such changes were
> batched together into a single new commit). I needed to see why exactly
> these paths were taking up so much space.
>
> To this, I turned to Jeff Hostetler's new "git survey" command. This told me
> information about the total size of trees and blobs and told me the path
> name for the individual blobs that were the largest in the repository.

Nice.

> These
> paths were typically binary files that appeared only once or twice and did
> not account for the scale issues.

You mean they couldn't account for a 55GB to 125GB change in data size?

> I modified 'git survey' to use the same
> path-batching logic as in 'git backfill' to then consider each batch of
> objects and their size. Thus, the "path-walk API" was born. (This RFC
> contains a version that looks like it was created before 'git backfill'.)
>
> The repository I was looking at had a clear pattern in its top 100 file
> paths by on-disk size: 99 of them were CHANGELOG.json and CHANGELOG.md
> files. The .md files surprised me, since they were always simple appends of
> the previous .md file at the same path. The .json files were capped in how
> many versions were being listed (at least in recent versions) but the data
> it stored was harder to compress. So I went looking into how 'git push' was
> calculating these delta bases. Adding some debug information to 'git
> pack-objects' demonstrated that the previous file versions were not being
> matched as delta bases. Instead, other new blobs in the push were being used
> as delta bases.

Interesting.

> This meant that what should have been a trivial set of deltas bloated to
> 20-60 MB. (We will see later that it is possible for these to be 100-500
> KB.)
>
> Here is where I went on a little bit of a detour. (Come with me, it's
> important.) I knew that 'git pack-objects' used a name-hash to group objects
> by path, so I assumed that the reason these delta bases were not found was
> because the UNINTERESTING objects were not being added to the packing list.
> I have since discovered that this is incorrect, but I might have gotten
> stuck if I didn't think this.
>
> This seemed like a natural reason to extend the path-walk API to allow
> walking commits and tags as part of 'git pack-objects' behind a new
> '--path-walk' option. The idea here is to compute deltas among the objects
> that share a common path and then later go through the (type, name-hash,
> size) sorting system to find other delta bases across path boundaries. After
> a lot of testing, failing, and testing again, the implementation in this RFC
> finally works to achieve the goal. It's not pretty (especially with how it
> handles tags) but it gets the job done.

Yeah, if it can significantly improve the generated packfiles (without
drawbacks), it looks like a great improvement.

> In hindsight, I realized that the UNINTERESTING objects were being
> considered, but due to collisions in the name-hash algorithm these objects
> were being sorted outside of the delta computation window. For this reason,
> I thought to create a new name-hash algorithm. Thus, the --full-name-hash
> option for 'git pack-objects' and 'git repack' was born. This feature was
> split out and sent to the mailing list independently from this RFC.

So the question is "Is the problem fully solved by the new name-hash
algorithm or are there still benefits to using a new way to walk
repos?"

> RFC GOALS
> =========
>
> The goals of this RFC are:
>
>  1. To demonstrate potential applications of the path-walk API to motivate
>     its generality

Yeah, the "Path-walk API and applications" title summarizes this well.

> as these features are sent in full-quality patch series,
>     but in a different order.

I wonder if the patch series could have been separated and not all
sent in a 30 patch long series.

>  2. To communicate the discoveries found during the --path-walk and
>     --full-name-hash features in 'git pack-objects' and 'git repack'. This
>     includes comparing and contrasting the effectiveness of these features.

Thanks for communicating that.

>  3. To demonstrate the value of the path-based batching in the 'git survey'
>     feature, and to inspire others to think about what other statistics
>     would be valuable in that feature. (I anticipate that once a base is
>     established, multiple contributors will help expand its functionality
>     long into the future.)

Yeah, a better git sizer would be valuable for GitLab too and probably
everyone hosting a significant number of repos.

> RFC OUTLINE
> ===========
>
> The patches are grouped roughly by the application, in order of discovery:
>
>
> PART I: 'git backfill'
> ======================
>
> These patches introduce the 'git backfill' builtin including its
> '--batch-size' and '--sparse' options. While this is the first and simplest
> application, it is also the lowest priority in terms of user need.
>
>  * path-walk: introduce an object walk by path
>  * backfill: add builtin boilerplate
>  * backfill: basic functionality and tests
>  * backfill: add --batch-size= option
>  * backfill: add --sparse option
>  * backfill: assume --sparse when sparse-checkout is enabled

I would prefer a first patch series with all the above and the first
patch creating a technical doc called maybe
Documentation/technical/path-walk.txt which could contain a lot of
information from this RFC and perhaps technical details of how the
path-walk works and how it is different from a regular walk.

> DISCUSSION OF NAME HASH
> =======================
>
> One thing to talk about before digging into --path-walk and --full-name-hash
> features is the existing name-hash algorithm. This hash algorithm creates a
> uint32_t based on the final 16 characters of the path name, weighing the
> last characters more. There are multiple benefits to this:
>
>  1. Files of common types (.c, .txt, ...) may be grouped together.
>
>  2. Files that are renamed across directories may be grouped together.
>
> (Thanks, Junio, for making this second benefit clear.)
>
> The issue here is that some common patterns arise in repositories that use
> common path names across directories, and those files are creating name-hash
> collisions and making the sort less effective. One thing that can counteract
> these collisions is to increase the --window setting, but this significantly
> slows the delta computations.
>
> Thus, the --path-walk and --full-name-hash features both attempt to combat
> these name-hash collisions in very different ways. The --path-walk mechanism
> uses the path-walk API to consider batches of objects that all share the
> same path. This avoids storing every possible path names in memory while
> doing the object walk, but still gives nice boundaries for delta compression
> possibilities. After the path-walk is complete, the full packing list is
> still sorted via name-hash and this allows for cross-path deltas. This is
> critical!
>
> The --full-name-hash feature does the simpler choice of replacing the
> name-hash method with one that has fewer collisions, but loses the benefits
> of "nearby" paths having close hash values.

The benefits of "nearby" paths are only available with --path-walk,
not in the current way object packing works.

> This naturally leads to these two main differences in the two approaches:
>
>  1. The --full-name-hash feature is not good for 'git push' or similarly
>     small pack-files. Since it limits the delta chains to objects with the
>     same full path and loses the benefit of "nearby" paths, this feature
>     should be used for larger repacks. In my testing, 'git push' simulations
>     almost always have poor packing but 'git repack -adf' simulations have
>     packing rivaling the --path-walk option.

Interesting.

>  2. The --path-walk option changes the object ordering significantly,
>     meaning it may not ever be appropriate to combine with advanced
>     repacking features such as delta islands or even reachability bitmaps.
>     While my testing has shown that the --path-walk repacks are the most
>     efficient of all options, this limitation makes me hesitate to recommend
>     it wider than client repositories.

Might still be interesting to have it for client repos.

> One natural question to consider is to think about storing both the
> name-hash and the full-name-hash and doing two delta passes, each one
> sorting the objects by a different hash function. The first issue is that we
> don't want to store two hash values per object, as that will significantly
> increase the memory pressure during repacking.

Is one more uint32_t per object really increasing the memory pressure
so much? Isn't there a way to pack bits together to avoid that?

> This could be side-stepped by
> storing the full-name-hash in the packing list and then a second mapping
> from full-name-hash to name-hash. However, that still leads to the two
> passes taking extra time.

Isn't there a way to sort using both hash functions in a single pass?
(That is to perform a single pass and when considering an object sort
it using both hash functions before considering a different object.)

> The --path-walk approach is faster than even a
> single pass.

Is there a reason for that?

> And in the right scenarios, the --full-name-hash option is very
> close to the --path-walk results.
>
>
> ANALYSIS OF PACKING STRATEGIES
> ==============================
>
> Since I was focused on an internal monorepo that stored a large collection
> of Javascript packages, it should be no surprise that I eventually found
> other Javascript repositories that used similar tooling and thus had similar
> issues with unexplained scale problems. I'll use these four repositories as
> examples repeatedly, but one is actually public: microsoft/fluentui [6].
> I'll use Repo B, C, and D for the others, in increasing size.
>
> [6] https://github.com/microsoft/fluentui
>
> In each of these repositories, doing a full repack ('git repack -adf') right
> after cloning presents a sizeable reduction in space. This is expected for
> servers not being optimized for exactly the reachable set I'm cloning. So,
> I'll focus on how much the default repacking parameters compare to using the
> --full-name-hash or --path-walk options:
>
> | Repo     | Standard Repack | With --full-name-hash | With --path-walk |
> |----------|-----------------|-----------------------|------------------|
> | fluentui |         438 MB  |               168 MB  |          148 MB  |
> | Repo B   |       6,255 MB  |               829 MB  |          778 MB  |
> | Repo C   |      37,737 MB  |             7,125 MB  |        6,158 MB  |
> | Repo D   |     130,049 MB  |             6,190 MB  |        4,432 MB  |
>
>
> Hopefully these reductions show how much these name-hash collisions are
> causing issues in these repositories.

Yeah, right. It might be interesting to see the size reduction in % too.

> For the fluentui repo, I'm also able to share highlights from the 'git
> survey' output immediately after cloning and after repacking with the
> --path-walk option.
>
> First, we can consider the total reachable objects in each scenario:
>
> TOTAL OBJECT SIZES BY TYPE
> ================================================
> Object Type |  Count | Disk Size | Inflated Size
> ------------+--------+-----------+--------------
>     Commits |  20579 |  10443669 |      15092790
>       Trees | 276503 |  40212070 |     244429615
>       Blobs | 294500 | 635365661 |   10791187920
>
> TOTAL OBJECT SIZES BY TYPE
> ================================================
> Object Type |  Count | Disk Size | Inflated Size
> ------------+--------+-----------+--------------
>     Commits |  20579 |  10450605 |      15092790
>       Trees | 276503 |  31136263 |     244429615
>       Blobs | 294500 |  94442401 |   10791187920

Using % of size reduction or increase might help make it a bit clearer here too.

> Now, we can consider the top 10 file paths before and after repacking with
> the --path-walk feature:
>
> TOP FILES BY DISK SIZE
> ===================================================================
>                            Path | Count | Disk Size | Inflated Size
> --------------------------------+-------+-----------+--------------
>                       yarn.lock |   802 |  58060531 |     889120488
>  ...-experiments/CHANGELOG.json |   505 |  28439452 |     252723999
>  ...fabric-react/CHANGELOG.json |  1270 |  25556510 |     902756623
>  ...t-components/CHANGELOG.json |   176 |  20366365 |     244936649
>  ...act-charting/CHANGELOG.json |   590 |  20106422 |     208224460
>  ...e-components/CHANGELOG.json |   559 |  15761271 |     189061764
>  ...act-examples/CHANGELOG.json |   577 |  13615569 |     234949961
>  ...react-charting/CHANGELOG.md |   564 |  11205840 |     104337986
>  .../experiments/CHANGELOG.json |   569 |  10596377 |     123662770
>  ...i-fabric-react/CHANGELOG.md |  1263 |   8154248 |     261494258
>
>
> TOP FILES BY DISK SIZE
> ===================================================================
>                            Path | Count | Disk Size | Inflated Size
> --------------------------------+-------+-----------+--------------
>                       yarn.lock |   802 |   9909326 |     889120488
>  ...iceBrandGuide_16Sep2016.pdf |     1 |   2106334 |       2186005
>  ...out/src/images/download.jpg |     1 |   1845249 |       1846117
>  ...fluent-ui-logo-inverted.png |     3 |   1370372 |       1447493
>           .yarn/releases/cli.js |     1 |   1335657 |       6741614
>  ...c/images/fluent-ui-logo.png |     3 |   1272902 |       1341139
>  ...ages/fluent-ui-logo-dev.png |     3 |   1130989 |       1186897
>  ...nents/public/SegoeUI-VF.ttf |     1 |   1074046 |       1844524
>  ...ig/rush/npm-shrinkwrap.json |   138 |   1058531 |      89326567
>  ...Accessibility_29Sep2016.pdf |     1 |    856621 |        927268
>
>
> As we can see from this example, before the repack we are mainly seeing the
> disk space be dominated by objects that appear at paths with CHANGELOG.json
> or CHANGELOG.md, which are frequently hitting collisions with the default
> name-hash. After the repack with the --path-walk feature, we see the largest
> paths are what we expect: checked in binaries, frequently-edited yarn files,
> and other hard-to- compress data.
>
> The nice thing about this result is that we can point to files that are
> taking up space because there is no other way around it, not that the naming
> convention for the files is causing confusion during packing.

Yeah, nice result. I guess the result is the same or very similar when
the improved name-hash algorithm is used.

> WHAT TO DO NOW?
> ===============
>
> Thank you for reading this far. I hope that this has added context on the
> patch series for the --full-name-hash option, but also provided the right
> context for when I come back in a week or so with a review-ready version of
> the --path-walk option.
>
> These patches are rough and I want to make sure everyone knows that.
> Reviewing them for style or even basic organization may lead to some wasted
> time. I know that at least one of the patches got mangled and its diff does
> not match its description (I'm thinking specifically about "pack-objects:
> extract should_attempt_deltas()" but this could apply elsewhere).

Ok, I will wait for the next iteration before reviewing the patches then.

> But I think that interested parties could take my branch, build it, and give
> these features a try on their favorite repos. I'd love to hear feedback on
> the usability and effectiveness, especially if someone finds a case where
> the --path-walk option is less effective at packing data.

I might do that when reviewing the patch series later. I think it can
still be valuable for you to get some first feedback from just reading
this RFC.

Thanks.