On Tue, Sep 10, 2024 at 4:29 AM Derrick Stolee via GitGitGadget <gitgitgadget@xxxxxxxxx> wrote: > > This RFC is ultimately about introducing a new way to walk objects, called > the "path-walk API" in the new path-walk.[ch] files. Before digging into the > details of the API, let's discuss the applications which will hint at the > API's design. > > > APPLICATIONS OF THE PATH-WALK API > ================================= > > The applications of this API were discovered in the following order, though > I recommend reversing the order for the priority of their actual > implementation in future patch series: > > * git backfill: a builtin to download missing blobs in a blobless partial > clone, done in batches and grouped by the path they appear in to maximize > delta compression in each batch. Allows focusing on the paths of the > sparse-checkout to only get the blobs necessary for history queries in > the current focus. It's not very clear if this would be useful when doing a `git sparse-checkout add`, or a `git blame` on a file path not covered by the current sparse-checkout, or both. I think it would be clearer if there were a few examples. > * git survey: Jeff Hostetler built this feature [1] as a way to get > functionality similar to git-sizer [2], but using the internals of Git to > do it faster. It also displays information not available to git-sizer, > like the on-disk size of objects. This RFC presents a simplified version > of the builtin focused on listing the paths that contribute the most to > the on-disk size of trees and blobs. Not sure how `git survey` works, but `git sizer` works on a whole repo, so, if they work in the same way, I am not sure I see what a new path oriented way to walk repos would bring to the tool. > * git pack-objects --path-walk: In order to find a way to compute deltas > among objects of the same path, I applied the path-walk API to 'git > pack-objects' behind an optional flag. There are overlaps with the > '--sparse' option [3], [4] that can be used here. This provides perfect > partitioning by path name, without any possible collisions from the > name-hash algorithm. It also allows using the name-hash values to find > cross-path delta chains in a second pass. Do you mean that the new way to walk repos allows pack-object to perform better in the general case or maybe only in the case where partial clone without blobs and sparse-checkout are used as described in the git backfill related point above? > * git repack --full-name-hash: If we are worried about name-hsah s/name-hsah/name-hash/ > collisions, an easier thing to implement is a different name-hash > algorithm that is less likely to have collisions. Maybe it could help to remind people a bit (perhaps in a note or using a link) what name-hash collisions are and why we could be worried about them. Actually there is a "DISCUSSION OF NAME HASH" section below with explanations, so maybe a note could just tell about that section. > This feature was > already sent to the mailing list as a fully-reviewable series [5]. It is > included here because this series allows testing the --path-walk option > against the --full-name-hash. Do you mean that the new way to walk repos can be used along with `git repack --full-name-hash` (maybe with `git repack --full-name-hash --path-walk` or a config option or otherwise?) and that it brings some performance or other kind (better packs or which ones?) of improvements? > [1] https://github.com/microsoft/git/pull/667 [2] > https://github.com/github/git-sizer [3] > https://github.com/git/git/compare/5d826e972970a784bd7a7bdf587512510097b8c7...99dbbfa8ddbba2b620965d026d4ec199b8837a6f > [4] > https://devblogs.microsoft.com/devops/exploring-new-frontiers-for-git-push-performance/ > [5] > https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@xxxxxxxxx It looks like adding line breaks could help make the above link list more readable. > TIMELINE FOR CREATING THESE APPLICATIONS IN THIS ORDER > ====================================================== > > Here's the story about how these applications came about: I was tasked with > understanding why certain internal repositories were growing larger than > expected. (Feel free to skip. Otherwise, thank you for indulging me.) > > I first prototyped 'git backfill' as a way to download some of the largest > repositories without being blocked on a full clone. This batched download > mechanism allowed me to essentially have a retryable clone, since the client > could restart the process from scratch and skip any objects that were > already on disk. It was natural to batch based on the path of the blobs in > order to theoretically save time and network bandwidth due to better delta > calculations. > > While investigating these repositories, I had some data hinting at the total > size of the objects by type. By type (blob, tree, commit, tag?) or by path? Why would the size of the objects by type be interesting? Could trees delta poorly? > But I was most interested in learning what > exactly was causing this growth. I did have a hint that the "release" > branches were taking up much more space than the default branch, since > cloning with --single-branch resulted in ~55GB of data but then fetching the > release branches led to an additional ~125GB of data. Using "git diff" it > was clear that these branches stored some CHANGELOG.json and CHANGELOG.md > files, but the diffs were relatively small (but multiple such changes were > batched together into a single new commit). I needed to see why exactly > these paths were taking up so much space. > > To this, I turned to Jeff Hostetler's new "git survey" command. This told me > information about the total size of trees and blobs and told me the path > name for the individual blobs that were the largest in the repository. Nice. > These > paths were typically binary files that appeared only once or twice and did > not account for the scale issues. You mean they couldn't account for a 55GB to 125GB change in data size? > I modified 'git survey' to use the same > path-batching logic as in 'git backfill' to then consider each batch of > objects and their size. Thus, the "path-walk API" was born. (This RFC > contains a version that looks like it was created before 'git backfill'.) > > The repository I was looking at had a clear pattern in its top 100 file > paths by on-disk size: 99 of them were CHANGELOG.json and CHANGELOG.md > files. The .md files surprised me, since they were always simple appends of > the previous .md file at the same path. The .json files were capped in how > many versions were being listed (at least in recent versions) but the data > it stored was harder to compress. So I went looking into how 'git push' was > calculating these delta bases. Adding some debug information to 'git > pack-objects' demonstrated that the previous file versions were not being > matched as delta bases. Instead, other new blobs in the push were being used > as delta bases. Interesting. > This meant that what should have been a trivial set of deltas bloated to > 20-60 MB. (We will see later that it is possible for these to be 100-500 > KB.) > > Here is where I went on a little bit of a detour. (Come with me, it's > important.) I knew that 'git pack-objects' used a name-hash to group objects > by path, so I assumed that the reason these delta bases were not found was > because the UNINTERESTING objects were not being added to the packing list. > I have since discovered that this is incorrect, but I might have gotten > stuck if I didn't think this. > > This seemed like a natural reason to extend the path-walk API to allow > walking commits and tags as part of 'git pack-objects' behind a new > '--path-walk' option. The idea here is to compute deltas among the objects > that share a common path and then later go through the (type, name-hash, > size) sorting system to find other delta bases across path boundaries. After > a lot of testing, failing, and testing again, the implementation in this RFC > finally works to achieve the goal. It's not pretty (especially with how it > handles tags) but it gets the job done. Yeah, if it can significantly improve the generated packfiles (without drawbacks), it looks like a great improvement. > In hindsight, I realized that the UNINTERESTING objects were being > considered, but due to collisions in the name-hash algorithm these objects > were being sorted outside of the delta computation window. For this reason, > I thought to create a new name-hash algorithm. Thus, the --full-name-hash > option for 'git pack-objects' and 'git repack' was born. This feature was > split out and sent to the mailing list independently from this RFC. So the question is "Is the problem fully solved by the new name-hash algorithm or are there still benefits to using a new way to walk repos?" > RFC GOALS > ========= > > The goals of this RFC are: > > 1. To demonstrate potential applications of the path-walk API to motivate > its generality Yeah, the "Path-walk API and applications" title summarizes this well. > as these features are sent in full-quality patch series, > but in a different order. I wonder if the patch series could have been separated and not all sent in a 30 patch long series. > 2. To communicate the discoveries found during the --path-walk and > --full-name-hash features in 'git pack-objects' and 'git repack'. This > includes comparing and contrasting the effectiveness of these features. Thanks for communicating that. > 3. To demonstrate the value of the path-based batching in the 'git survey' > feature, and to inspire others to think about what other statistics > would be valuable in that feature. (I anticipate that once a base is > established, multiple contributors will help expand its functionality > long into the future.) Yeah, a better git sizer would be valuable for GitLab too and probably everyone hosting a significant number of repos. > RFC OUTLINE > =========== > > The patches are grouped roughly by the application, in order of discovery: > > > PART I: 'git backfill' > ====================== > > These patches introduce the 'git backfill' builtin including its > '--batch-size' and '--sparse' options. While this is the first and simplest > application, it is also the lowest priority in terms of user need. > > * path-walk: introduce an object walk by path > * backfill: add builtin boilerplate > * backfill: basic functionality and tests > * backfill: add --batch-size= option > * backfill: add --sparse option > * backfill: assume --sparse when sparse-checkout is enabled I would prefer a first patch series with all the above and the first patch creating a technical doc called maybe Documentation/technical/path-walk.txt which could contain a lot of information from this RFC and perhaps technical details of how the path-walk works and how it is different from a regular walk. > DISCUSSION OF NAME HASH > ======================= > > One thing to talk about before digging into --path-walk and --full-name-hash > features is the existing name-hash algorithm. This hash algorithm creates a > uint32_t based on the final 16 characters of the path name, weighing the > last characters more. There are multiple benefits to this: > > 1. Files of common types (.c, .txt, ...) may be grouped together. > > 2. Files that are renamed across directories may be grouped together. > > (Thanks, Junio, for making this second benefit clear.) > > The issue here is that some common patterns arise in repositories that use > common path names across directories, and those files are creating name-hash > collisions and making the sort less effective. One thing that can counteract > these collisions is to increase the --window setting, but this significantly > slows the delta computations. > > Thus, the --path-walk and --full-name-hash features both attempt to combat > these name-hash collisions in very different ways. The --path-walk mechanism > uses the path-walk API to consider batches of objects that all share the > same path. This avoids storing every possible path names in memory while > doing the object walk, but still gives nice boundaries for delta compression > possibilities. After the path-walk is complete, the full packing list is > still sorted via name-hash and this allows for cross-path deltas. This is > critical! > > The --full-name-hash feature does the simpler choice of replacing the > name-hash method with one that has fewer collisions, but loses the benefits > of "nearby" paths having close hash values. The benefits of "nearby" paths are only available with --path-walk, not in the current way object packing works. > This naturally leads to these two main differences in the two approaches: > > 1. The --full-name-hash feature is not good for 'git push' or similarly > small pack-files. Since it limits the delta chains to objects with the > same full path and loses the benefit of "nearby" paths, this feature > should be used for larger repacks. In my testing, 'git push' simulations > almost always have poor packing but 'git repack -adf' simulations have > packing rivaling the --path-walk option. Interesting. > 2. The --path-walk option changes the object ordering significantly, > meaning it may not ever be appropriate to combine with advanced > repacking features such as delta islands or even reachability bitmaps. > While my testing has shown that the --path-walk repacks are the most > efficient of all options, this limitation makes me hesitate to recommend > it wider than client repositories. Might still be interesting to have it for client repos. > One natural question to consider is to think about storing both the > name-hash and the full-name-hash and doing two delta passes, each one > sorting the objects by a different hash function. The first issue is that we > don't want to store two hash values per object, as that will significantly > increase the memory pressure during repacking. Is one more uint32_t per object really increasing the memory pressure so much? Isn't there a way to pack bits together to avoid that? > This could be side-stepped by > storing the full-name-hash in the packing list and then a second mapping > from full-name-hash to name-hash. However, that still leads to the two > passes taking extra time. Isn't there a way to sort using both hash functions in a single pass? (That is to perform a single pass and when considering an object sort it using both hash functions before considering a different object.) > The --path-walk approach is faster than even a > single pass. Is there a reason for that? > And in the right scenarios, the --full-name-hash option is very > close to the --path-walk results. > > > ANALYSIS OF PACKING STRATEGIES > ============================== > > Since I was focused on an internal monorepo that stored a large collection > of Javascript packages, it should be no surprise that I eventually found > other Javascript repositories that used similar tooling and thus had similar > issues with unexplained scale problems. I'll use these four repositories as > examples repeatedly, but one is actually public: microsoft/fluentui [6]. > I'll use Repo B, C, and D for the others, in increasing size. > > [6] https://github.com/microsoft/fluentui > > In each of these repositories, doing a full repack ('git repack -adf') right > after cloning presents a sizeable reduction in space. This is expected for > servers not being optimized for exactly the reachable set I'm cloning. So, > I'll focus on how much the default repacking parameters compare to using the > --full-name-hash or --path-walk options: > > | Repo | Standard Repack | With --full-name-hash | With --path-walk | > |----------|-----------------|-----------------------|------------------| > | fluentui | 438 MB | 168 MB | 148 MB | > | Repo B | 6,255 MB | 829 MB | 778 MB | > | Repo C | 37,737 MB | 7,125 MB | 6,158 MB | > | Repo D | 130,049 MB | 6,190 MB | 4,432 MB | > > > Hopefully these reductions show how much these name-hash collisions are > causing issues in these repositories. Yeah, right. It might be interesting to see the size reduction in % too. > For the fluentui repo, I'm also able to share highlights from the 'git > survey' output immediately after cloning and after repacking with the > --path-walk option. > > First, we can consider the total reachable objects in each scenario: > > TOTAL OBJECT SIZES BY TYPE > ================================================ > Object Type | Count | Disk Size | Inflated Size > ------------+--------+-----------+-------------- > Commits | 20579 | 10443669 | 15092790 > Trees | 276503 | 40212070 | 244429615 > Blobs | 294500 | 635365661 | 10791187920 > > TOTAL OBJECT SIZES BY TYPE > ================================================ > Object Type | Count | Disk Size | Inflated Size > ------------+--------+-----------+-------------- > Commits | 20579 | 10450605 | 15092790 > Trees | 276503 | 31136263 | 244429615 > Blobs | 294500 | 94442401 | 10791187920 Using % of size reduction or increase might help make it a bit clearer here too. > Now, we can consider the top 10 file paths before and after repacking with > the --path-walk feature: > > TOP FILES BY DISK SIZE > =================================================================== > Path | Count | Disk Size | Inflated Size > --------------------------------+-------+-----------+-------------- > yarn.lock | 802 | 58060531 | 889120488 > ...-experiments/CHANGELOG.json | 505 | 28439452 | 252723999 > ...fabric-react/CHANGELOG.json | 1270 | 25556510 | 902756623 > ...t-components/CHANGELOG.json | 176 | 20366365 | 244936649 > ...act-charting/CHANGELOG.json | 590 | 20106422 | 208224460 > ...e-components/CHANGELOG.json | 559 | 15761271 | 189061764 > ...act-examples/CHANGELOG.json | 577 | 13615569 | 234949961 > ...react-charting/CHANGELOG.md | 564 | 11205840 | 104337986 > .../experiments/CHANGELOG.json | 569 | 10596377 | 123662770 > ...i-fabric-react/CHANGELOG.md | 1263 | 8154248 | 261494258 > > > TOP FILES BY DISK SIZE > =================================================================== > Path | Count | Disk Size | Inflated Size > --------------------------------+-------+-----------+-------------- > yarn.lock | 802 | 9909326 | 889120488 > ...iceBrandGuide_16Sep2016.pdf | 1 | 2106334 | 2186005 > ...out/src/images/download.jpg | 1 | 1845249 | 1846117 > ...fluent-ui-logo-inverted.png | 3 | 1370372 | 1447493 > .yarn/releases/cli.js | 1 | 1335657 | 6741614 > ...c/images/fluent-ui-logo.png | 3 | 1272902 | 1341139 > ...ages/fluent-ui-logo-dev.png | 3 | 1130989 | 1186897 > ...nents/public/SegoeUI-VF.ttf | 1 | 1074046 | 1844524 > ...ig/rush/npm-shrinkwrap.json | 138 | 1058531 | 89326567 > ...Accessibility_29Sep2016.pdf | 1 | 856621 | 927268 > > > As we can see from this example, before the repack we are mainly seeing the > disk space be dominated by objects that appear at paths with CHANGELOG.json > or CHANGELOG.md, which are frequently hitting collisions with the default > name-hash. After the repack with the --path-walk feature, we see the largest > paths are what we expect: checked in binaries, frequently-edited yarn files, > and other hard-to- compress data. > > The nice thing about this result is that we can point to files that are > taking up space because there is no other way around it, not that the naming > convention for the files is causing confusion during packing. Yeah, nice result. I guess the result is the same or very similar when the improved name-hash algorithm is used. > WHAT TO DO NOW? > =============== > > Thank you for reading this far. I hope that this has added context on the > patch series for the --full-name-hash option, but also provided the right > context for when I come back in a week or so with a review-ready version of > the --path-walk option. > > These patches are rough and I want to make sure everyone knows that. > Reviewing them for style or even basic organization may lead to some wasted > time. I know that at least one of the patches got mangled and its diff does > not match its description (I'm thinking specifically about "pack-objects: > extract should_attempt_deltas()" but this could apply elsewhere). Ok, I will wait for the next iteration before reviewing the patches then. > But I think that interested parties could take my branch, build it, and give > these features a try on their favorite repos. I'd love to hear feedback on > the usability and effectiveness, especially if someone finds a case where > the --path-walk option is less effective at packing data. I might do that when reviewing the patch series later. I think it can still be valuable for you to get some first feedback from just reading this RFC. Thanks.