On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget <gitgitgadget@xxxxxxxxx> wrote: > > This is based on ds/more-index-cleanups also available as GitGitGadget PR > #839. > > The sparse checkout feature allows users to specify a "populated set" that > is smaller than the full list of files at HEAD. Files outside the sparse > checkout definition are not present in the working directory, but are still > present in the index (and marked with the CE_SKIP_WORKTREE bit). > > This means that the working directory has size O(populated), and commands > like "git status" or "git checkout" operate using an O(populated) number of > filesystem operations. However, parsing the index still operates on the > scale of O(HEAD). > > This can be particularly jarring if you are merging a small repository with > a large monorepo for the purpose of simplifying dependency management. Even > if users have nothing more in their working directory than they had before, > they suddenly see a significant increase in their "git status" or "git add" > times. In these cases, simply parsing the index can be a huge portion of the > command time. > > This RFC proposes an update to the index formats to allow "sparse directory > entries". These entries correspond to directories that are completely > excluded from the sparse checkout definition. We can detect that a directory > is excluded when using "cone mode" patterns. > > Since having directory entries is a radical departure from the existing > index format, a new extension "extensions.sparseIndex" is added. Using a > sparse index should cause incompatible tools to fail because they do not > understand this extension. > > The index is a critical data structure, so making such a drastic change must > be handled carefully. This RFC does only enough adjustments to demonstrate > performance improvements for "git status" and "git add." Other commands > should operate identically to before, since the other commands will expand a > sparse index into a full index by parsing trees. > > WARNING: I'm getting a failure on the FreeBSD build with my sparse-checkout > tests. I'm not sure what is causing these failures, but I will explore while > we discuss the possibility of the feature as a whole. > > Here is an outline for this RFC: > > * Patches 1-14: create and test the sparse index format. This is just > enough to start writing the format, but all Git commands become slower > for using it. This is because everything is guarded to expand to a full > index before actually operating on the cache entries. > > * Patch 15: This massive patch is actually a bunch of patches squashed > together. I have a branch that adds "ensure_full_index()" guards in each > necessary file along with some commentary about how the index is being > used. This patch is presented here as one big dump because that > commentary isn't particularly interesting if the RFC leads to a very > different approach. > > * Patches 16-27: These changes make enough code "sparse aware" such that > "git status" and "git add" start operating in time O(populated) instead > of O(HEAD). > > Performance numbers are given in patch 27, but repeated somewhat here. The > test environment I use has ~2.1 million paths at HEAD, but only 68,000 > populated paths given the sparse-checkout I'm using. The sparse index has > about 2,000 sparse directory entries. > > 1. Use the full index. The index size is ~186 MB. > 2. Use the sparse index. The index size is ~5.5 MB. > 3. Use a commit where HEAD matches the populated set. The full index size > is ~5.3MB. > > The third benchmark is included as a theoretical optimum for a repository of > the same object database. > > First, a clean 'git status' improves from 3.1s to 240ms. > > Benchmark #1: full index (git status) Time (mean ± σ): 3.167 s ± 0.036 s > [User: 2.006 s, System: 1.078 s] Range (min … max): 3.100 s … 3.208 s 10 > runs > > Benchmark #2: sparse index (git status) Time (mean ± σ): 239.5 ms ± 8.1 ms > [User: 189.4 ms, System: 226.8 ms] Range (min … max): 226.0 ms … 251.9 ms 13 > runs > > Benchmark #3: small tree (git status) Time (mean ± σ): 195.3 ms ± 4.5 ms > [User: 116.5 ms, System: 84.4 ms] Range (min … max): 188.8 ms … 202.8 ms 15 > runs > > The performance numbers for 'git add .' are much closer to optimal: > > Benchmark #1: full index (git add .) Time (mean ± σ): 3.076 s ± 0.022 s > [User: 2.065 s, System: 0.943 s] Range (min … max): 3.044 s … 3.116 s 10 > runs > > Benchmark #2: sparse index (git add .) Time (mean ± σ): 218.0 ms ± 6.6 ms > [User: 195.7 ms, System: 206.6 ms] Range (min … max): 209.8 ms … 228.2 ms 13 > runs > > Benchmark #3: small tree (git add .) Time (mean ± σ): 217.6 ms ± 5.4 ms > [User: 131.9 ms, System: 86.7 ms] Range (min … max): 212.1 ms … 228.4 ms 14 > runs > > I expect that making a sparse index work optimally through the most common > Git commands will take a year of effort. During this process, I expect to > add a lot of testing infrastructure around the sparse-checkout feature, > especially in corner cases. (This RFC focuses on the happy paths of > operating only within the sparse cone, but that will change in the future.) > > If this general approach is acceptable, then I would follow it with a > sequence of patch submissions that follow this approach: > > 1. Basics of the format. (Patches 1-14) > 2. Add additional guards around index interactions (Patch 15, but split > appropriately.) > 3. Speed up "git status" and "git add" (Patches 16-27) > > After those three items that are represented in this RFC, the work starts to > parallelize a bit. My basic ideas for moving forward from this point are to > do these basic steps: > > * Add new index API abstractions where appropriate, make them sparse-aware. > * Add new tests around sparse-checkout corner cases. Ensure the sparse > index works correctly. > * For a given builtin, add extra testing for sparse-checkouts then it them > sparse-aware. > > Here are some specific questions I'm hoping to answer in this RFC period: > > 1. Are these sparse directory entries an appropriate way to extend the > index format? > 2. Is extensions.sparseIndex a good way to signal that these entries can > exist? > 3. Is git sparse-checkout init --cone --sparse-index an appropriate way to > toggle the format? > 4. Are there specific areas that I should target to harden the index API > before I submit this work? > 5. Does anyone have a good idea how to test a large portion of the test > suite with sparse-index enabled? The problem I see is that most tests > don't use sparse-checkout, so the sparse index is identical to the full > index. Would it be interesting to enable the test setup to add such > "extra" directories during the test setup? > > Thanks, -Stolee Thanks for working on this. It's very exciting seeing this idea come alive. I had lots of little nitpicks and questions and whatnot here and there, but that almost feels like a diversion. Overall, I think you divided up the series in a very logical and easy to follow fashion, and actually achieved quite a bit already. I suspect I have partially answered some of your questions above among all my comments, and left others unanswered or worse, just re-asked the same question(s) myself. Feel free to ping again with the next round and I'll see if I can dodge your questions again...er, I mean, try to think of something helpful to say. :-) > > Derrick Stolee (27): > sparse-index: add guard to ensure full index > sparse-index: implement ensure_full_index() > t1092: compare sparse-checkout to sparse-index > test-read-cache: print cache entries with --table > test-tool: read-cache --table --no-stat > test-tool: don't force full index > unpack-trees: ensure full index > sparse-checkout: hold pattern list in index > sparse-index: convert from full to sparse > submodule: sparse-index should not collapse links > unpack-trees: allow sparse directories > sparse-index: check index conversion happens > sparse-index: create extension for compatibility > sparse-checkout: toggle sparse index from builtin > [RFC-VERSION] *: ensure full index > unpack-trees: make sparse aware > dir.c: accept a directory as part of cone-mode patterns > status: use sparse-index throughout > status: skip sparse-checkout percentage with sparse-index > sparse-index: expand_to_path() trivial implementation > sparse-index: expand_to_path no-op if path exists > add: allow operating on a sparse-only index > submodule: die_path_inside_submodule is sparse aware > dir: use expand_to_path in add_patterns() > fsmonitor: disable if index is sparse > pathspec: stop calling ensure_full_index > cache-tree: integrate with sparse directory entries > > Documentation/config/extensions.txt | 7 + > Documentation/git-sparse-checkout.txt | 14 + > Makefile | 1 + > apply.c | 10 +- > blame.c | 7 +- > builtin/add.c | 3 + > builtin/checkout-index.c | 5 +- > builtin/commit.c | 3 + > builtin/grep.c | 2 + > builtin/ls-files.c | 9 +- > builtin/merge-index.c | 2 + > builtin/mv.c | 2 + > builtin/rm.c | 2 + > builtin/sparse-checkout.c | 35 ++- > builtin/update-index.c | 2 + > cache-tree.c | 21 ++ > cache.h | 15 +- > diff.c | 2 + > dir.c | 19 +- > dir.h | 2 +- > entry.c | 2 + > fsmonitor.c | 18 +- > merge-recursive.c | 22 +- > name-hash.c | 10 + > pathspec.c | 4 +- > pathspec.h | 4 +- > preload-index.c | 2 + > read-cache.c | 51 +++- > repo-settings.c | 15 + > repository.c | 11 +- > repository.h | 3 + > rerere.c | 2 + > resolve-undo.c | 6 + > setup.c | 3 + > sha1-name.c | 3 + > sparse-index.c | 360 +++++++++++++++++++++++ > sparse-index.h | 23 ++ > split-index.c | 2 + > submodule.c | 22 +- > submodule.h | 6 +- > t/helper/test-read-cache.c | 77 ++++- > t/t1092-sparse-checkout-compatibility.sh | 130 +++++++- > tree.c | 2 + > unpack-trees.c | 40 ++- > wt-status.c | 21 +- > wt-status.h | 1 + > 46 files changed, 942 insertions(+), 61 deletions(-) > create mode 100644 sparse-index.c > create mode 100644 sparse-index.h > > > base-commit: 2271fe7848aa11b30e5313d95d9caebc2937fce5 > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-847%2Fderrickstolee%2Fsparse-index%2Frfc-v1 > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-847/derrickstolee/sparse-index/rfc-v1 > Pull-Request: https://github.com/gitgitgadget/git/pull/847 > -- > gitgitgadget