I've taken a professional interest in the index lately, and I've been trying mostly to learn about it and measure different operations. Along the way, I've seen some possible improvements in documentation, behavior, and tracing. This series collects most of my thoughts so far, including: 1. Adding extra trace2 regions and statistics (similar to [1]) (Patches 1-5). 2. Update documentation about the cached tree extension (Patches 6-7). 3. Improved the performance of verify_cache() (Patches 8-9). Thanks, -Stolee [1] https://lore.kernel.org/git/pull.828.git.1609302714183.gitgitgadget@xxxxxxxxx/ UPDATES IN V3 ============= * Added a patch that fixes previous uses of "cached tree" * Updated preamble patch with typos and semantic corrections. Derrick Stolee (9): tree-walk: report recursion counts unpack-trees: add trace2 regions cache-tree: use trace2 in cache_tree_update() cache-tree: trace regions for I/O cache-tree: trace regions for prime_cache_tree index-format: use 'cache tree' over 'cached tree' index-format: update preamble to cache tree extension index-format: discuss recursion of cached-tree better cache-tree: speed up consecutive path comparisons René Scharfe (1): cache-tree: use ce_namelen() instead of strlen() Documentation/technical/index-format.txt | 42 ++++++++++++++++++------ cache-tree.c | 30 +++++++++++++---- t/t7104-reset-hard.sh | 2 +- tree-walk.c | 33 +++++++++++++++++++ unpack-trees.c | 5 +++ 5 files changed, 94 insertions(+), 18 deletions(-) base-commit: 71ca53e8125e36efbda17293c50027d31681a41f Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-829%2Fderrickstolee%2Fcache-tree%2Fbasics-v3 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-829/derrickstolee/cache-tree/basics-v3 Pull-Request: https://github.com/gitgitgadget/git/pull/829 Range-diff vs v2: 1: 0e500c86f39 = 1: 0e500c86f39 tree-walk: report recursion counts 2: 4157b91acf8 = 2: 4157b91acf8 unpack-trees: add trace2 regions 3: 8959d57abdd = 3: 8959d57abdd cache-tree: use trace2 in cache_tree_update() 4: 1d8a797ee26 = 4: 1d8a797ee26 cache-tree: trace regions for I/O 5: 2b2e70bb77c = 5: 2b2e70bb77c cache-tree: trace regions for prime_cache_tree -: ----------- > 6: 2d7b18c2e0b index-format: use 'cache tree' over 'cached tree' 6: 75b51483d3c ! 7: c5cffb5956e index-format: update preamble to cached tree extension @@ Metadata Author: Derrick Stolee <dstolee@xxxxxxxxxxxxx> ## Commit message ## - index-format: update preamble to cached tree extension + index-format: update preamble to cache tree extension - I had difficulty in my efforts to learn about the cached tree extension + I had difficulty in my efforts to learn about the cache tree extension based on the documentation and code because I had an incorrect assumption about how it behaved. This might be due to some ambiguity in the documentation, so this change modifies the beginning of the cached @@ Commit message 3. When exactly are "new" trees created? + Helped-by: Junio C Hamano <gitster@xxxxxxxxx> Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> ## Documentation/technical/index-format.txt ## @@ Documentation/technical/index-format.txt: Git index format - === Cached tree + === Cache tree -- Cached tree extension contains pre-computed hashes for trees that can +- Cache tree extension contains pre-computed hashes for trees that can - be derived from the index. It helps speed up tree object generation - from index for a new commit. - @@ Documentation/technical/index-format.txt: Git index format + Since the index does not record entries for directories, the cache + entries cannot describe tree objects that already exist in the object + database for regions of the index that are unchanged from an existing -+ commit. The cached tree extension stores a recursive tree structure that ++ commit. The cache tree extension stores a recursive tree structure that + describes the trees that already exist and completely match sections of + the cache entries. This speeds up tree object generation from the index + for a new commit by only computing the trees that are "new" to that -+ commit. ++ commit. It also assists when comparing the index to another tree, such ++ as `HEAD^{tree}`, since sections of the index can be skipped when a tree ++ comparison demonstrates equality. + + The recursive tree structure uses nodes that store a number of cache + entries, a list of subnodes, and an object ID (OID). The OID references -+ the exising tree for that node, if it is known to exist. The subnodes -+ correspond to subdirectories that themselves have cached tree nodes. The ++ the existing tree for that node, if it is known to exist. The subnodes ++ correspond to subdirectories that themselves have cache tree nodes. The + number of cache entries corresponds to the number of cache entries in + the index that describe paths within that tree's directory. + -+ Note that the path for a given tree is part of the parent node in-memory -+ but is part of the child in the file format. The root tree has an empty -+ string for its name and its name does not exist in-memory. ++ The extension tracks the full directory structure in the cache tree ++ extension, but this is generally smaller than the full cache entry list. + + When a path is updated in index, Git invalidates all nodes of the -+ recurisive cached tree corresponding to the parent directories of that ++ recursive cache tree corresponding to the parent directories of that + path. We store these tree nodes as being "invalid" by using "-1" as the -+ number of cache entries. To create trees corresponding to the current -+ index, Git only walks the invalid tree nodes and uses the cached OIDs -+ for the valid trees to construct new trees. In this way, Git only -+ constructs trees on the order of the number of changed paths (and their -+ depth in the working directory). This comes at a cost of tracking the -+ full directory structure in the cached tree extension, but this is -+ generally smaller than the full cache entry list in the index. ++ number of cache entries. Invalid nodes still store a span of index ++ entries, allowing Git to focus its efforts when reconstructing a full ++ cache tree. The signature for this extension is { 'T', 'R', 'E', 'E' }. 7: b2bb141a254 = 8: 97c06c80a85 index-format: discuss recursion of cached-tree better 8: 5298694786e = 9: 2532f5cc189 cache-tree: use ce_namelen() instead of strlen() 9: 72edd7bb427 = 10: 7c1c206a0bc cache-tree: speed up consecutive path comparisons -- gitgitgadget