Junio C Hamano wrote: > "Victoria Dye via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > >> From: Victoria Dye <vdye@xxxxxxxxxx> >> >> If a sparse directory does not exist in the index, unpack it at the >> directory level rather than recursing into it an unpacking its contents >> file-by-file. This helps keep the sparse index as collapsed as possible in >> cases such as 'git reset --hard' restoring a sparse directory. > > My reading hiccuped at "does not exist in". After reading the above > twice, I think the situation is that there is a directory A/B in > which a file C sits, and A/ is marked as sparse and the paragraph is > talking about directory A/B. Because the index has A/ as a tree in > index, A/B does not exist in the index. > > When we _somehow_ need to ensure that A/B exists in the index for > _unknown_ reason, we could flatten the index fully and have A/B/C as > a blob (without A or A/B in the index proper, even though they may > appear in cache-tree), but if we stop at removing A and adding A/B > still as a tree without showing A/B/C in the index, we may gain > efficiency, under the assumption that we do not have to access A/B/C > and its siblings. > > Did I read the intention correctly? I suspect future readers of the > commit that would result from this patch would also wonder what the > "somehow" and "unknown" above are. > If I'm reading this correctly, it's not quite what I meant - the situation this patch addresses is when _nothing_ in the tree rooted at 'A/' (files, subdirectories) exists in the index, but 'unpack_trees()' is merging tree(s) where 'A/' *does* exist. I originally wanted to write "If a sparse directory *is deleted from* the index", but that doesn't cover the case where 'A/' never existed in the index and we're merging in tree(s) that introduce it. Maybe it would be clearer to describe it with a different perspective: "If 'unpack_callback()' is merging a new tree into a sparse index, merge the tree as a sparse directory rather than traversing its contents" or something like that? I'll try to come up with a better way of explaining this and update in V2. >> A directory is determined to be truly non-existent in the index (rather than >> the parent of existing index entries), if 1) its path is outside the sparse >> cone and 2) there are no children of the directory in the index. This check >> is performed by 'missing_dir_is_sparse()' in 'unpack_single_entry()'. If the >> directory is a missing sparse dir, 'unpack_single_entry()' will proceed >> with unpacking it. This determination is also propagated back up to >> 'unpack_callback()' via 'is_missing_sparse_dir' to prevent further tree >> traversal into the unpacked directory. > > Makes sense. > >> diff --git a/unpack-trees.c b/unpack-trees.c >> index 8a454e03bff..aa62cef20fe 100644 >> --- a/unpack-trees.c >> +++ b/unpack-trees.c >> @@ -1069,6 +1069,53 @@ static struct cache_entry *create_ce_entry(const struct traverse_info *info, >> return ce; >> } >> >> +/* >> + * Determine whether the path specified corresponds to a sparse directory >> + * completely missing from the index. This function is assumed to only be >> + * called when the named path isn't already in the index. >> + */ >> +static int missing_dir_is_sparse(const struct traverse_info *info, >> + const struct name_entry *p) >> +{ >> + int res, pos; >> + struct strbuf dirpath = STRBUF_INIT; >> + struct unpack_trees_options *o = info->data; >> + >> + /* >> + * First, check whether the path is in the sparse cone. If it is, >> + * then this directory shouldn't be sparse. >> + */ >> + strbuf_add(&dirpath, info->traverse_path, info->pathlen); >> + strbuf_add(&dirpath, p->path, p->pathlen); >> + strbuf_addch(&dirpath, '/'); >> + if (path_in_cone_mode_sparse_checkout(dirpath.buf, o->src_index)) { >> + res = 0; >> + goto cleanup; >> + } > > OK. > >> + /* >> + * Given that the directory is not inside the sparse cone, it could be >> + * (partially) expanded in the index. If child entries exist, the path >> + * is not a missing sparse directory. >> + */ >> + pos = index_name_pos_sparse(o->src_index, dirpath.buf, dirpath.len); >> + if (pos >= 0) >> + BUG("cache entry '%s%s' shouldn't exist in the index", >> + info->traverse_path, p->path); > > So, we fed 'p' to this function, and we didn't expect it to exist in > the index if it is outside the sparse cone. > Correct; if an index entry with the name 'p' existed, 'src[0]' would not be NULL in 'unpack_callback()'/'unpack_single_entry()' and this function wouldn't have been called. >> + pos = -pos - 1; > > This is the location that a cache entry for the dirpath.buf would > sit in the index if it were a sparse entry. > >> + if (pos >= o->src_index->cache_nr) { >> + res = 1; >> + goto cleanup; >> + } >> + res = strncmp(o->src_index->cache[pos]->name, dirpath.buf, dirpath.len); > > So, we found where dirpath.buf would be inserted. If we (can) look > at the cache entry that is currently sitting at that position, and > find that it is a path inside it, then res becomes zero. IOW, we > found that the sparse directory is missing in the index, but there > is a path that is in the directory recorded in the index. > > The current entry, on the other hand, is outside the dirpath.buf, > then res becomes non-zero. We are asked to yield true (i.e. > nonzero) if and only if the given directory and all paths in that > directory are missing from the index, so that is what happens here. > Sounds OK. > > And the "out of bounds" check just means that the entries that sit > near the end of the index sort strictly before our dirpath.buf, so > they cannot be inside our directory, which is also the case where we > are expected to yield "true". > > OK. > >> + >> +cleanup: >> + strbuf_release(&dirpath); >> + return res; >> +} >> + >> /* >> * Note that traverse_by_cache_tree() duplicates some logic in this function >> * without actually calling it. If you change the logic here you may need to >> @@ -1078,21 +1125,40 @@ static int unpack_single_entry(int n, unsigned long mask, >> unsigned long dirmask, >> struct cache_entry **src, >> const struct name_entry *names, >> - const struct traverse_info *info) >> + const struct traverse_info *info, >> + int *is_missing_sparse_dir) >> { >> int i; >> struct unpack_trees_options *o = info->data; >> unsigned long conflicts = info->df_conflicts | dirmask; >> + const struct name_entry *p = names; >> >> - if (mask == dirmask && !src[0]) >> - return 0; >> + *is_missing_sparse_dir = 0; >> + if (mask == dirmask && !src[0]) { >> + /* >> + * If the directory is completely missing from the index but >> + * would otherwise be a sparse directory, we should unpack it. >> + * If not, we'll return and continue recursively traversing the >> + * tree. >> + */ >> + if (!o->src_index->sparse_index) >> + return 0; >> + >> + /* Find first entry with a real name (we could use "mask" too) */ >> + while (!p->mode) >> + p++; >> + >> + *is_missing_sparse_dir = missing_dir_is_sparse(info, p); >> + if (!*is_missing_sparse_dir) >> + return 0; > > Interesting. Nobody checked if the name in 'p' is a directory up to > this point, but missing_dir_is_sparse() does not care? The only > thing we know is that !src[0], so the name represented by 'p' does > not exist in the index. IIRC, the function only checked if p names > a path inside or outside the sparse cone and if there are or are not > paths that would be inside the path if p _were_ a directory but I > didn't read it carefully what it did when the caller fed a path to a > non-directory to the function. As long as it will say "no" to such > a situation, I won't complain, but I have to wonder how the logic > around here tells apart a case where a sparse directory is > completely hidden (which missing_dir_is_sparse() gives us) from a > case where there is absolutely nothing, not a directory and not a > blob, at the path in the index. Or perhaps it works correctly > without having to tell them apart? I dunno. > I wrote 'missing_dir_is_sparse()' in an attempt keep some complex logic out of the already-complicated 'unpack_single_entry()', but as part of that it relies on information already established by its caller. We know 'p' is a directory because 'missing_dir_is_sparse()' is called inside a 'mask == dirmask' condition. 'mask' is a representation of which trees being traversed have an entry with the given name and 'dirmask' is a representation of which of those entries are directories, so the only way 'mask == dirmask' and 'p' is *not* a directory is if the currently-traversed entries in all of the trees do not exist. *That* won't happen because 'unpack_callback()' won't be invoked at all by 'traverse_trees()' if 'mask' is 0. Given that it requires jumping through multiple function invocations and callbacks to figure that out, I can add some assertions or 'return 0' conditions at the beginning of 'missing_dir_is_sparse()' to codify its assumptions. Even if the assertions are slightly redundant now, they'll make the code clearer and make the function safer for reuse. > By the way, there is a comment before the function that reminds us > that traverse_by_cache_tree() may need a matching change whenever a > change is made to this function. Does this require such a matching > change? In this case, I *think* a parallel change there is unnecessary. 'traverse_by_cache_tree()' iterates on the contents of the index, so its equivalent of 'src[0]' always exists when it merges individual entries. And, to trigger that function, the cache tree needs to identically match the tree(s) being merged; if a subtree doesn't exist in either the index or any of the merging trees, there's nothing to merge. In any case, thanks for the reminder - I didn't notice the comment while implementing this, so it could have just as easily needed fixing.