Here is the first full patch series submission coming out of the sparse-index RFC [1]. [1] https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@xxxxxxxxx/ I won't waste too much space here, because PATCH 1 includes a sizeable design document that describes the feature, the reasoning behind it, and my plan for getting this implemented widely throughout the codebase. There are some new things here that were not in the RFC: * Design doc and format updates. (Patch 1) * Performance test script. (Patches 2 and 20) Notably missing in this series from the RFC: * The mega-patch inserting ensure_full_index() throughout the codebase. That will be a follow-up series to this one. * The integrations with git status and git add to demonstrate the improved performance. Those will also appear in their own series later. I plan to keep my latest work in this area in my 'sparse-index/wip' branch [2]. It includes all of the work from the RFC right now, updated with the work from this series. [2] https://github.com/derrickstolee/git/tree/sparse-index/wip Updates in V4 ============= * Rebased onto the latest copy of ab/read-tree. * Updated the design document as per Junio's comments. * Updated the submodule handling in the performance test. * Followed up on some other review from Ævar, mostly style or commit message things. Updates in V3 ============= For this version, I took Ævar's latest patches and applied them to v2.31.0 and rebased this series on top. It uses his new "read_tree_at()" helper and the associated changes to the function pointer type. * Fixed more typos. Thanks Martin and Elijah! * Updated the test_sparse_match() macro to use "$@" instead of $* * Added a test that git sparse-checkout init --no-sparse-index rewrites the index to be full. Updates in V2 ============= * Various typos and awkward grammar is fixed. * Cleaned up unnecessary commands in p2000-sparse-operations.sh * Added a comment to the sparse_index member of struct index_state. * Used tree_type, commit_type, and blob_type in test-read-cache.c. Thanks, -Stolee Derrick Stolee (20): sparse-index: design doc and format update t/perf: add performance test for sparse operations t1092: clean up script quoting sparse-index: add guard to ensure full index sparse-index: implement ensure_full_index() t1092: compare sparse-checkout to sparse-index test-read-cache: print cache entries with --table test-tool: don't force full index unpack-trees: ensure full index sparse-checkout: hold pattern list in index sparse-index: convert from full to sparse submodule: sparse-index should not collapse links unpack-trees: allow sparse directories sparse-index: check index conversion happens sparse-index: create extension for compatibility sparse-checkout: toggle sparse index from builtin sparse-checkout: disable sparse-index cache-tree: integrate with sparse directory entries sparse-index: loose integration with cache_tree_verify() p2000: add sparse-index repos Documentation/config/extensions.txt | 8 + Documentation/git-sparse-checkout.txt | 14 ++ Documentation/technical/index-format.txt | 7 + Documentation/technical/sparse-index.txt | 174 ++++++++++++++ Makefile | 1 + builtin/sparse-checkout.c | 44 +++- cache-tree.c | 40 ++++ cache.h | 18 +- read-cache.c | 35 ++- repo-settings.c | 15 ++ repository.c | 11 +- repository.h | 3 + setup.c | 3 + sparse-index.c | 293 +++++++++++++++++++++++ sparse-index.h | 11 + t/README | 3 + t/helper/test-read-cache.c | 66 ++++- t/perf/p2000-sparse-operations.sh | 101 ++++++++ t/t1091-sparse-checkout-builtin.sh | 13 + t/t1092-sparse-checkout-compatibility.sh | 143 +++++++++-- unpack-trees.c | 17 +- 21 files changed, 980 insertions(+), 40 deletions(-) create mode 100644 Documentation/technical/sparse-index.txt create mode 100644 sparse-index.c create mode 100644 sparse-index.h create mode 100755 t/perf/p2000-sparse-operations.sh base-commit: 47957485b3b731a7860e0554d2bd12c0dce1c75a Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-883%2Fderrickstolee%2Fsparse-index%2Fformat-v4 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-883/derrickstolee/sparse-index/format-v4 Pull-Request: https://github.com/gitgitgadget/git/pull/883 Range-diff vs v3: 1: 62ac13945bec ! 1: 6426a5c60e53 sparse-index: design doc and format update @@ Documentation/technical/index-format.txt: Git index format + is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the + `extensions.sparseIndex` extension is enabled, then the index may + contain entries for directories outside of the sparse-checkout definition. -+ These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and ++ These entries have mode `040000`, include the `SKIP_WORKTREE` bit, and + the path ends in a directory separator. + 32-bit ctime seconds, the last time a file's metadata changed @@ Documentation/technical/sparse-index.txt (new) +`core.sparseCheckoutCone`, allow for very fast pattern matching to +discover which files at HEAD belong in the sparse-checkout cone. + -+Three important scale dimensions for a Git worktree are: ++Three important scale dimensions for a Git working directory are: + +* `HEAD`: How many files are present at `HEAD`? + @@ Documentation/technical/sparse-index.txt (new) + +These dimensions are ordered by their magnitude: users (typically) modify +fewer files than are populated, and we can only populate files at `HEAD`. -+These dimensions are also ordered by how expensive they are per item: it -+is expensive to detect a modified file than it is to write one that we -+know must be populated; changing `HEAD` only really requires updating the -+index. + +Problems occur if there is an extreme imbalance in these dimensions. For +example, if `HEAD` contains millions of paths but the populated set has @@ Documentation/technical/sparse-index.txt (new) +At time of writing, sparse-directory entries violate expectations about the +index format and its in-memory data structure. There are many consumers in +the codebase that expect to iterate through all of the index entries and -+see only files. In addition, they expect to see all files at `HEAD`. One -+way to handle this is to parse trees to replace a sparse-directory entry -+with all of the files within that tree as the index is loaded. However, -+parsing trees is slower than parsing the index format, so that is a slower -+operation than if we left the index alone. ++see only files. In fact, these loops expect to see a reference to every ++staged file. One way to handle this is to parse trees to replace a ++sparse-directory entry with all of the files within that tree as the index ++is loaded. However, parsing trees is slower than parsing the index format, ++so that is a slower operation than if we left the index alone. The plan is ++to make all of these integrations "sparse aware" so this expansion through ++tree parsing is unnecessary and they use fewer resources than when using a ++full index. + +The implementation plan below follows four phases to slowly integrate with +the sparse-index. The intention is to incrementally update Git commands to @@ Documentation/technical/sparse-index.txt (new) +data structure can operate with its current assumption of every file at +`HEAD`. + -+At first, every index parse will expand the sparse-directory entries into -+the full list of paths at `HEAD`. This will be slower in all cases. The -+only noticable change in behavior will be that the serialized index file -+contains sparse-directory entries. ++At first, every index parse will call a helper method, ++`ensure_full_index()`, which scans the index for sparse-directory entries ++(pointing to trees) and replaces them with the full list of paths (with ++blob contents) by parsing tree objects. This will be slower in all cases. ++The only noticeable change in behavior will be that the serialized index ++file contains sparse-directory entries. + +To start, we use a new repository extension, `extensions.sparseIndex`, to +allow inserting sparse-directory entries into indexes with file format 2: d2197e895e4d ! 2: 7eabc1d0586c t/perf: add performance test for sparse operations @@ t/perf/p2000-sparse-operations.sh (new) + +test_expect_success 'setup repo and indexes' ' + git reset --hard HEAD && ++ + # Remove submodules from the example repo, because our -+ # duplication of the entire repo creates an unlikly data shape. -+ git config --file .gitmodules --get-regexp "submodule.*.path" >modules && -+ git rm -f .gitmodules && -+ for module in $(awk "{print \$2}" modules) -+ do -+ git rm $module || return 1 -+ done && -+ git commit -m "remove submodules" && ++ # duplication of the entire repo creates an unlikely data shape. ++ if git config --file .gitmodules --get-regexp "submodule.*.path" >modules ++ then ++ git rm $(awk "{print \$2}" modules) && ++ git commit -m "remove submodules" || return 1 ++ fi && + + echo bogus >a && + cp a b && 3: d3cfd34b8418 ! 3: c9e21d78ecba t1092: clean up script quoting @@ Commit message t1092: clean up script quoting This test was introduced in 19a0acc83e4 (t1092: test interesting - sparse-checkout scenarios, 2021-01-23), but these issues with quoting - were not noticed until starting this follow-up series. The old mechanism - would drop quoting such as in + sparse-checkout scenarios, 2021-01-23), but it contains issues with quoting + that were not noticed until starting this follow-up series. The old + mechanism would drop quoting such as in test_all_match git commit -m "touch README.md" 4: 4472118cf903 = 4: 03cdde756563 sparse-index: add guard to ensure full index 5: 99292cdbaae4 = 5: 6b3b6d86385d sparse-index: implement ensure_full_index() 6: fae5663a17bb = 6: 7f67adba0498 t1092: compare sparse-checkout to sparse-index 7: dffe8821fde2 ! 7: 7ebd9570b1ad test-read-cache: print cache entries with --table @@ Commit message a sparse-index. Further, 'git ls-tree' does not use a trailing directory separator for its tree rows. - This does not print the stat() information for the blobs. That could be + This does not print the stat() information for the blobs. That will be added in a future change with another option. The tests that are added in the next few changes care only about the object types and IDs. + However, this future need for full index information justifies the need + for this test helper over extending a user-facing feature, such as 'git + ls-files'. To make the option parsing slightly more robust, wrap the string comparisons in a loop adapted from test-dir-iterator.c. 8: f4ad081f25bb = 8: db7bbd06dbcc test-tool: don't force full index 9: 4780076a50df = 9: 3ddd5e794b5e unpack-trees: ensure full index 10: 33fdba2b8cfd = 10: 7308c87697f1 sparse-checkout: hold pattern list in index 11: e41b14e03ebb ! 11: 7c10d653ca6b sparse-index: convert from full to sparse @@ cache.h: static inline unsigned int create_ce_mode(unsigned int mode) { if (S_ISLNK(mode)) return S_IFLNK; -+ if (mode == S_IFDIR) ++ if (S_ISSPARSEDIR(mode)) + return S_IFDIR; if (S_ISDIR(mode) || S_ISGITLINK(mode)) return S_IFGITLINK; 12: b77cd6b02265 = 12: 6db36f33e960 submodule: sparse-index should not collapse links 13: 4000c5cdd4cf ! 13: d24bd3348d98 unpack-trees: allow sparse directories @@ unpack-trees.c: static int index_pos_by_traverse_info(struct name_entry *names, + if (!o->src_index->sparse_index || + !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE)) + BUG("This is a directory and should not exist in index"); -+ } else ++ } else { + pos = -pos - 1; ++ } if (pos >= o->src_index->cache_nr || !starts_with(o->src_index->cache[pos]->name, name.buf) || (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf))) 14: 1a2be38b2ca7 = 14: 08d9f5f3c0d1 sparse-index: check index conversion happens 15: f89891b0ae4e = 15: 6f38cef196b0 sparse-index: create extension for compatibility 16: bd703c76c859 = 16: 923081e7e079 sparse-checkout: toggle sparse index from builtin 17: 598557f90a2a = 17: 6f1ad72c390d sparse-checkout: disable sparse-index 18: c2d0c17db31a = 18: bd94e6b7d089 cache-tree: integrate with sparse directory entries 19: 6fdd9323c14e = 19: e7190376b806 sparse-index: loose integration with cache_tree_verify() 20: 3db06ac46dd5 = 20: bcf0a58eb38c p2000: add sparse-index repos -- gitgitgadget