Re: [PATCH 00/27] [RFC] Sparse Index

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@xxxxxxxxx> wrote:
>
> This is based on ds/more-index-cleanups also available as GitGitGadget PR
> #839.
>
> The sparse checkout feature allows users to specify a "populated set" that
> is smaller than the full list of files at HEAD. Files outside the sparse
> checkout definition are not present in the working directory, but are still
> present in the index (and marked with the CE_SKIP_WORKTREE bit).
>
> This means that the working directory has size O(populated), and commands
> like "git status" or "git checkout" operate using an O(populated) number of
> filesystem operations. However, parsing the index still operates on the
> scale of O(HEAD).
>
> This can be particularly jarring if you are merging a small repository with
> a large monorepo for the purpose of simplifying dependency management. Even
> if users have nothing more in their working directory than they had before,
> they suddenly see a significant increase in their "git status" or "git add"
> times. In these cases, simply parsing the index can be a huge portion of the
> command time.
>
> This RFC proposes an update to the index formats to allow "sparse directory
> entries". These entries correspond to directories that are completely
> excluded from the sparse checkout definition. We can detect that a directory
> is excluded when using "cone mode" patterns.
>
> Since having directory entries is a radical departure from the existing
> index format, a new extension "extensions.sparseIndex" is added. Using a
> sparse index should cause incompatible tools to fail because they do not
> understand this extension.
>
> The index is a critical data structure, so making such a drastic change must
> be handled carefully. This RFC does only enough adjustments to demonstrate
> performance improvements for "git status" and "git add." Other commands
> should operate identically to before, since the other commands will expand a
> sparse index into a full index by parsing trees.
>
> WARNING: I'm getting a failure on the FreeBSD build with my sparse-checkout
> tests. I'm not sure what is causing these failures, but I will explore while
> we discuss the possibility of the feature as a whole.
>
> Here is an outline for this RFC:
>
>  * Patches 1-14: create and test the sparse index format. This is just
>    enough to start writing the format, but all Git commands become slower
>    for using it. This is because everything is guarded to expand to a full
>    index before actually operating on the cache entries.
>
>  * Patch 15: This massive patch is actually a bunch of patches squashed
>    together. I have a branch that adds "ensure_full_index()" guards in each
>    necessary file along with some commentary about how the index is being
>    used. This patch is presented here as one big dump because that
>    commentary isn't particularly interesting if the RFC leads to a very
>    different approach.
>
>  * Patches 16-27: These changes make enough code "sparse aware" such that
>    "git status" and "git add" start operating in time O(populated) instead
>    of O(HEAD).
>
> Performance numbers are given in patch 27, but repeated somewhat here. The
> test environment I use has ~2.1 million paths at HEAD, but only 68,000
> populated paths given the sparse-checkout I'm using. The sparse index has
> about 2,000 sparse directory entries.
>
>  1. Use the full index. The index size is ~186 MB.
>  2. Use the sparse index. The index size is ~5.5 MB.
>  3. Use a commit where HEAD matches the populated set. The full index size
>     is ~5.3MB.
>
> The third benchmark is included as a theoretical optimum for a repository of
> the same object database.
>
> First, a clean 'git status' improves from 3.1s to 240ms.
>
> Benchmark #1: full index (git status) Time (mean ± σ): 3.167 s ± 0.036 s
> [User: 2.006 s, System: 1.078 s] Range (min … max): 3.100 s … 3.208 s 10
> runs
>
> Benchmark #2: sparse index (git status) Time (mean ± σ): 239.5 ms ± 8.1 ms
> [User: 189.4 ms, System: 226.8 ms] Range (min … max): 226.0 ms … 251.9 ms 13
> runs
>
> Benchmark #3: small tree (git status) Time (mean ± σ): 195.3 ms ± 4.5 ms
> [User: 116.5 ms, System: 84.4 ms] Range (min … max): 188.8 ms … 202.8 ms 15
> runs
>
> The performance numbers for 'git add .' are much closer to optimal:
>
> Benchmark #1: full index (git add .) Time (mean ± σ): 3.076 s ± 0.022 s
> [User: 2.065 s, System: 0.943 s] Range (min … max): 3.044 s … 3.116 s 10
> runs
>
> Benchmark #2: sparse index (git add .) Time (mean ± σ): 218.0 ms ± 6.6 ms
> [User: 195.7 ms, System: 206.6 ms] Range (min … max): 209.8 ms … 228.2 ms 13
> runs
>
> Benchmark #3: small tree (git add .) Time (mean ± σ): 217.6 ms ± 5.4 ms
> [User: 131.9 ms, System: 86.7 ms] Range (min … max): 212.1 ms … 228.4 ms 14
> runs
>
> I expect that making a sparse index work optimally through the most common
> Git commands will take a year of effort. During this process, I expect to
> add a lot of testing infrastructure around the sparse-checkout feature,
> especially in corner cases. (This RFC focuses on the happy paths of
> operating only within the sparse cone, but that will change in the future.)
>
> If this general approach is acceptable, then I would follow it with a
> sequence of patch submissions that follow this approach:
>
>  1. Basics of the format. (Patches 1-14)
>  2. Add additional guards around index interactions (Patch 15, but split
>     appropriately.)
>  3. Speed up "git status" and "git add" (Patches 16-27)
>
> After those three items that are represented in this RFC, the work starts to
> parallelize a bit. My basic ideas for moving forward from this point are to
> do these basic steps:
>
>  * Add new index API abstractions where appropriate, make them sparse-aware.
>  * Add new tests around sparse-checkout corner cases. Ensure the sparse
>    index works correctly.
>  * For a given builtin, add extra testing for sparse-checkouts then it them
>    sparse-aware.
>
> Here are some specific questions I'm hoping to answer in this RFC period:
>
>  1. Are these sparse directory entries an appropriate way to extend the
>     index format?
>  2. Is extensions.sparseIndex a good way to signal that these entries can
>     exist?
>  3. Is git sparse-checkout init --cone --sparse-index an appropriate way to
>     toggle the format?
>  4. Are there specific areas that I should target to harden the index API
>     before I submit this work?
>  5. Does anyone have a good idea how to test a large portion of the test
>     suite with sparse-index enabled? The problem I see is that most tests
>     don't use sparse-checkout, so the sparse index is identical to the full
>     index. Would it be interesting to enable the test setup to add such
>     "extra" directories during the test setup?
>
> Thanks, -Stolee

Thanks for working on this.  It's very exciting seeing this idea come
alive.  I had lots of little nitpicks and questions and whatnot here
and there, but that almost feels like a diversion.  Overall, I think
you divided up the series in a very logical and easy to follow
fashion, and actually achieved quite a bit already.

I suspect I have partially answered some of your questions above among
all my comments, and left others unanswered or worse, just re-asked
the same question(s) myself.  Feel free to ping again with the next
round and I'll see if I can dodge your questions again...er, I mean,
try to think of something helpful to say.  :-)

>
> Derrick Stolee (27):
>   sparse-index: add guard to ensure full index
>   sparse-index: implement ensure_full_index()
>   t1092: compare sparse-checkout to sparse-index
>   test-read-cache: print cache entries with --table
>   test-tool: read-cache --table --no-stat
>   test-tool: don't force full index
>   unpack-trees: ensure full index
>   sparse-checkout: hold pattern list in index
>   sparse-index: convert from full to sparse
>   submodule: sparse-index should not collapse links
>   unpack-trees: allow sparse directories
>   sparse-index: check index conversion happens
>   sparse-index: create extension for compatibility
>   sparse-checkout: toggle sparse index from builtin
>   [RFC-VERSION] *: ensure full index
>   unpack-trees: make sparse aware
>   dir.c: accept a directory as part of cone-mode patterns
>   status: use sparse-index throughout
>   status: skip sparse-checkout percentage with sparse-index
>   sparse-index: expand_to_path() trivial implementation
>   sparse-index: expand_to_path no-op if path exists
>   add: allow operating on a sparse-only index
>   submodule: die_path_inside_submodule is sparse aware
>   dir: use expand_to_path in add_patterns()
>   fsmonitor: disable if index is sparse
>   pathspec: stop calling ensure_full_index
>   cache-tree: integrate with sparse directory entries
>
>  Documentation/config/extensions.txt      |   7 +
>  Documentation/git-sparse-checkout.txt    |  14 +
>  Makefile                                 |   1 +
>  apply.c                                  |  10 +-
>  blame.c                                  |   7 +-
>  builtin/add.c                            |   3 +
>  builtin/checkout-index.c                 |   5 +-
>  builtin/commit.c                         |   3 +
>  builtin/grep.c                           |   2 +
>  builtin/ls-files.c                       |   9 +-
>  builtin/merge-index.c                    |   2 +
>  builtin/mv.c                             |   2 +
>  builtin/rm.c                             |   2 +
>  builtin/sparse-checkout.c                |  35 ++-
>  builtin/update-index.c                   |   2 +
>  cache-tree.c                             |  21 ++
>  cache.h                                  |  15 +-
>  diff.c                                   |   2 +
>  dir.c                                    |  19 +-
>  dir.h                                    |   2 +-
>  entry.c                                  |   2 +
>  fsmonitor.c                              |  18 +-
>  merge-recursive.c                        |  22 +-
>  name-hash.c                              |  10 +
>  pathspec.c                               |   4 +-
>  pathspec.h                               |   4 +-
>  preload-index.c                          |   2 +
>  read-cache.c                             |  51 +++-
>  repo-settings.c                          |  15 +
>  repository.c                             |  11 +-
>  repository.h                             |   3 +
>  rerere.c                                 |   2 +
>  resolve-undo.c                           |   6 +
>  setup.c                                  |   3 +
>  sha1-name.c                              |   3 +
>  sparse-index.c                           | 360 +++++++++++++++++++++++
>  sparse-index.h                           |  23 ++
>  split-index.c                            |   2 +
>  submodule.c                              |  22 +-
>  submodule.h                              |   6 +-
>  t/helper/test-read-cache.c               |  77 ++++-
>  t/t1092-sparse-checkout-compatibility.sh | 130 +++++++-
>  tree.c                                   |   2 +
>  unpack-trees.c                           |  40 ++-
>  wt-status.c                              |  21 +-
>  wt-status.h                              |   1 +
>  46 files changed, 942 insertions(+), 61 deletions(-)
>  create mode 100644 sparse-index.c
>  create mode 100644 sparse-index.h
>
>
> base-commit: 2271fe7848aa11b30e5313d95d9caebc2937fce5
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-847%2Fderrickstolee%2Fsparse-index%2Frfc-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-847/derrickstolee/sparse-index/rfc-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/847
> --
> gitgitgadget




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux