Re: [PATCH v4 2/4] sparse-index: avoid unnecessary cache tree clearing

Derrick Stolee <stolee@xxxxxxxxx> · Fri, 29 Oct 2021 15:00:01 -0400

On 10/29/2021 2:45 PM, Junio C Hamano wrote:
> "Victoria Dye via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> 
>> From: Victoria Dye <vdye@xxxxxxxxxx>
>>
>> When converting a full index to sparse, clear and recreate the cache tree
>> only if the cache tree is not fully valid. The convert_to_sparse operation
>> should exit silently if a cache tree update cannot be successfully completed
>> (e.g., due to a conflicted entry state). However, because this failure
>> scenario only occurs when at least a portion of the cache tree is invalid,
>> we can save ourselves the cost of clearing and recreating the cache tree by
>> skipping the check when the cache tree is fully valid.
> 
> I see in cache-tree.c::update_one() this snippet of code.
> 
> 	/*
> 	 * If the first entry of this region is a sparse directory
> 	 * entry corresponding exactly to 'base', then this cache_tree
> 	 * struct is a "leaf" in the data structure, pointing to the
> 	 * tree OID specified in the entry.
> 	 */
> 	if (entries > 0) {
> 		const struct cache_entry *ce = cache[0];
> 
> 		if (S_ISSPARSEDIR(ce->ce_mode) &&
> 		    ce->ce_namelen == baselen &&
> 		    !strncmp(ce->name, base, baselen)) {
> 			it->entry_count = 1;
> 			oidcpy(&it->oid, &ce->oid);
> 			return 1;
> 		}
> 	}
> 
> Sorry for not noticing it earlier, but does this mean that the
> content of a cache-tree changes shape when sparse-ness of the index
> changes?  Is a cache-tree that knows about all of the
> subdirectories, even the ones that are descendants of a directory
> that is represented as a tree-ish entry in the main index array,
> still valid in a sparse index?
> 
> If not, then I do not think of a quick and sure way to ensure that
> the cache-tree is valid when the sparse-ness changes.
> 
> The earlier suggestion was based on my assumption that even when the
> main index array becomes sparse, the cache tree is still populated
> and valid, so that after writing a tree and writing an on-disk index,
> and then reading the on-disk index back (possibly in another process),
> would not have to incur the recomputation cost of the full tree when
> the reading codepath needs to flip the sparseness.
>
> But the above code snippet makes me worried a lot.  A cache-tree
> that used to be valid when the corresponding in-core index array was
> not sparse will become invalid immediately when we decide to make it
> sparse, right?

I think I would argue that the cache-tree is valid in this case. Each
node represents the tree that would be created from that region of the
index if we called 'git commit'. We can reconstruct the contained
subtrees by walking trees if necessary, but that isn't necessary at
this point.

I am writing a blog post about the sparse index, and I go into detail
about how the cache tree is modified. I use the
derrickstolee/trace-flamechart repo [1] as an example, so I will use
it as a concrete discussion of what is happening here.

Here is the cache-tree without sparse-index:

 [ root (left: 0, size: 16) ]
  |
  +---- [ bin/ (left: 2, size: 1) ] 
  |
  +---- [ examples/ (left: 3, size: 11) ]
        |
        +---- [ fetch/ (left: 3, size: 8) ]
        |
        +---- [ maintenance/ (left: 11, size: 3) ]

Then, I run

	git sparse-checkout init --cone --sparse-index
	git sparse-checkout set bin

the cache-tree looks like this:

 [ root (left: 0, size: 6) ]
  |
  +---- [ bin/ (left: 2, size: 1) ] 
  |
  +---- [ examples/ (left: 3, size: 1) ]

The tree OIDs at each of "root", "bin" and "examples" match the trees
for the nodes in the non-sparse cache tree.

By contrast, including any subtrees of "examples" would cause a few
things to happen:

1. The size of the cache-tree would increase (significantly in large
   repos).

2. The contained cache-tree nodes would have size _zero_, since there
   are no cache entries defined within their directories.

I think in this pruned form, the cache-tree is valid and serves all of
the roles it was designed for. We can quickly compute the root tree in
'git commit'. We can navigate the cache-tree in traverse_by_cache_tree()
to accurately describe the trees in the index. These things are tested
in t1092 and the p2000 performance test.

Does this satisfy your concern, or is there a larger expectation of the
cache-tree extension that I am not taking into account?

Thanks,
-Stolee