On 1/6/2021 9:10 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > >> Subject: Re: [PATCH v2 6/9] index-format: update preamble to cached tree extension > > By the way, the name of the extension is "cache tree". > > git grep -i 'cached[- ]tree' ':!po/' > > reveals there are a handful of mistakes already present but their > number is dwarfed when we check: > > git grep -i 'cache tree' ':!po/' I will fix my own additions and add a patch that fixes these mistakes. >> + Since the index does not record entries for directories, the cache >> + entries cannot describe tree objects that already exist in the object >> + database for regions of the index that are unchanged from an existing >> + commit. The cached tree extension stores a recursive tree structure that >> + describes the trees that already exist and completely match sections of >> + the cache entries. This speeds up tree object generation from the index >> + for a new commit by only computing the trees that are "new" to that >> + commit. > > The original motivation was the above one. A cache of tree objects > that correspond to unmodified part of the directory structure helps > writing out a new tree out of modified index. > > We later found out that we rather often compare the index against > the tree of HEAD (think: "git status"), and diff-lib.c::diff_cache() > does take advantage of the fact that an entire directory can be > skipped if the tree object taken from the HEAD side exactly matches > the tree recorded for the subdirectory in the cache tree extension. I need to read more about this. traverse_by_cache_tree() seems to be a good place to start. Thanks. >> + The recursive tree structure uses nodes that store a number of cache >> + entries, a list of subnodes, and an object ID (OID). The OID references >> + the exising tree for that node, if it is known to exist. The subnodes >> + correspond to subdirectories that themselves have cached tree nodes. The >> + number of cache entries corresponds to the number of cache entries in >> + the index that describe paths within that tree's directory. s/exising/existing/ > > OK. > >> + Note that the path for a given tree is part of the parent node in-memory > > Sorry, I am not sure if I follow. The top-level in-core cache_tree > object records the number of entries, tree object name for the > entire tree (if valid), and cache_tree_sub structures, one for each > subdirectory. Each of the cache_tree_sub structure describes the > "child" directory, including the path to it. > >> + but is part of the child in the file format. The root tree has an empty >> + string for its name and its name does not exist in-memory. > > It's more like we could have consistently used cache_tree_sub > instances to represent each and every level (i.e. I consider that > cache_tree_sub is what represents a directory, with cache_tree being > a record of just one aspect of it) including the root of the > hierarchy, but because there wasn't much point in giving a name to > the root level, I cheated and avoided wasting a cache_tree_sub for > it. So from that point of view, the path belongs to the node in > each level in both in-core and on-disk representations. That's a good point. I'll retract my statement here. >> + When a path is updated in index, Git invalidates all nodes of the >> + recurisive cached tree corresponding to the parent directories of that >> + path. We store these tree nodes as being "invalid" by using "-1" as the >> + number of cache entries. > > Correct. Making note of my s/recurisive/recursive/ typo here. >> + To create trees corresponding to the current >> + index, Git only walks the invalid tree nodes and uses the cached OIDs >> + for the valid trees to construct new trees. > > I wonder if the above is sufficiently clear, or "Git only has to > walk the spans of index entries that corresponds to the invalid > trees, while reusing the ..." is too long and detailed. I will try to simplify. Thanks, -Stolee