While working on sparse index integration for 'git rm' [1], Shaoxuan found that removed sparse directories, when reset, would no longer be sparse. This was due to how 'unpack_trees()' determined whether a traversed directory was a sparse directory or not; it would only unpack an entry as a sparse directory if it existed in the index. However, if the sparse directory was removed, it would be treated like a non-sparse directory and its contents would be individually unpacked. To avoid this unnecessary traversal and keep the results of 'reset' as sparse as possible, the decision logic for whether a directory is sparse is changed to: * If the directory is a sparse directory in the index, unpack it. * If not, is the directory inside the sparse cone? If so, do not unpack it. * If the directory is outside the sparse cone, does it have any child entries in the index? If so, do not unpack it. * Otherwise, unpack the entry as a sparse directory. In the process of updating 'reset', a separate issue was found in 'checkout' where collapsed sparse directories did not have modified contents reported file-by-file. A similar bug was found with 'status' in 2c521b0e49 (status: fix nested sparse directory diff in sparse index, 2022-03-01), and 'checkout' was corrected the same way (setting the diff flag 'recursive' to 1). Changes since V1 ================ * Reverted the removal of 'index_entry_exists()' to avoid breaking other in-flight series. * Renamed 'is_missing_sparse_dir()' to 'is_new_sparse_dir()'; revised comments and commit messages to clarify what that function is doing and why. * Handled "unexpected" inputs to 'is_new_sparse_dir()' more gently, returning 0 if 'p' is not a directory or the directory already exists in the index (rather than exiting with 'BUG()'). This is intended to make 'is_new_sparse_dir()' less reliant on information about the index established by 'unpack_callback()' & 'unpack_single_entry()', resulting in easier-to-read and more reusable code. Thanks! * Victoria [1] https://lore.kernel.org/git/20220803045118.1243087-1-shaoxuan.yuan02@xxxxxxxxx/ Victoria Dye (4): checkout: fix nested sparse directory diff in sparse index oneway_diff: handle removed sparse directories cache.h: create 'index_name_pos_sparse()' unpack-trees: unpack new trees as sparse directories builtin/checkout.c | 1 + cache.h | 9 ++ diff-lib.c | 5 ++ read-cache.c | 5 ++ t/t1092-sparse-checkout-compatibility.sh | 25 ++++++ unpack-trees.c | 106 ++++++++++++++++++++--- 6 files changed, 141 insertions(+), 10 deletions(-) base-commit: 4af7188bc97f70277d0f10d56d5373022b1fa385 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1312%2Fvdye%2Freset%2Fhandle-missing-dirs-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1312/vdye/reset/handle-missing-dirs-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/1312 Range-diff vs v1: 1: 255318f4dc6 = 1: 255318f4dc6 checkout: fix nested sparse directory diff in sparse index 2: 55c77ba4b29 = 2: 55c77ba4b29 oneway_diff: handle removed sparse directories 3: f7978d223fe ! 3: d0bdec63286 cache.h: replace 'index_entry_exists()' with 'index_name_pos_sparse()' @@ Metadata Author: Victoria Dye <vdye@xxxxxxxxxx> ## Commit message ## - cache.h: replace 'index_entry_exists()' with 'index_name_pos_sparse()' + cache.h: create 'index_name_pos_sparse()' - Replace 'index_entry_exists()' (which returns a binary '1' or '0' depending - on whether a specified entry exists in the index) with - 'index_name_pos_sparse()' (which behaves the same as 'index_name_pos()', + Add 'index_name_pos_sparse()', which behaves the same as 'index_name_pos()', except that it does not expand a sparse index to search for an entry inside - a sparse directory). + a sparse directory. - 'index_entry_exists()' was original implemented in 20ec2d034c (reset: make - sparse-aware (except --mixed), 2021-11-29) to allow callers to search for an - index entry without expanding a sparse index. That particular case only - required knowing whether the requested entry existed. This patch expands the - amount of information returned by indicating both 1) whether the entry - exists, and 2) its position (or potential position) in the index. + 'index_entry_exists()' was originally implemented in 20ec2d034c (reset: make + sparse-aware (except --mixed), 2021-11-29) as an alternative to + 'index_name_pos()' to allow callers to search for an index entry without + expanding a sparse index. However, that particular use case only required + knowing whether the requested entry existed, so 'index_entry_exists()' does + not return the index positioning information provided by 'index_name_pos()'. - Signed-off-by: Victoria Dye <vdye@xxxxxxxxxx> + This patch implements 'index_name_pos_sparse()' to accommodate callers that + need the positioning information of 'index_name_pos()', but do not want to + expand the index. - ## cache-tree.c ## -@@ cache-tree.c: static void prime_cache_tree_rec(struct repository *r, - * as normal. - */ - if (r->index->sparse_index && -- index_entry_exists(r->index, tree_path->buf, tree_path->len)) -+ index_name_pos_sparse(r->index, tree_path->buf, tree_path->len) >= 0) - prime_cache_tree_sparse_dir(sub->cache_tree, subtree); - else - prime_cache_tree_rec(r, sub->cache_tree, subtree, tree_path); + Signed-off-by: Victoria Dye <vdye@xxxxxxxxxx> ## cache.h ## @@ cache.h: struct cache_entry *index_file_exists(struct index_state *istate, const char *na + */ int index_name_pos(struct index_state *, const char *name, int namelen); - /* -- * Determines whether an entry with the given name exists within the -- * given index. The return value is 1 if an exact match is found, otherwise -- * it is 0. Note that, unlike index_name_pos, this function does not expand -- * the index if it is sparse. If an item exists within the full index but it -- * is contained within a sparse directory (and not in the sparse index), 0 is -- * returned. -- */ --int index_entry_exists(struct index_state *, const char *name, int namelen); ++/* + * Like index_name_pos, returns the position of an entry of the given name in + * the index if one exists, otherwise returns a negative value where the negated + * value minus 1 is the position where the index entry would be inserted. Unlike @@ cache.h: struct cache_entry *index_file_exists(struct index_state *istate, const + * inside a sparse directory. + */ +int index_name_pos_sparse(struct index_state *, const char *name, int namelen); - ++ /* - * Some functions return the negative complement of an insert position when a + * Determines whether an entry with the given name exists within the + * given index. The return value is 1 if an exact match is found, otherwise ## read-cache.c ## @@ read-cache.c: int index_name_pos(struct index_state *istate, const char *name, int namelen) return index_name_stage_pos(istate, name, namelen, 0, EXPAND_SPARSE); } --int index_entry_exists(struct index_state *istate, const char *name, int namelen) +int index_name_pos_sparse(struct index_state *istate, const char *name, int namelen) - { -- return index_name_stage_pos(istate, name, namelen, 0, NO_EXPAND_SPARSE) >= 0; ++{ + return index_name_stage_pos(istate, name, namelen, 0, NO_EXPAND_SPARSE); - } - - int remove_index_entry_at(struct index_state *istate, int pos) ++} ++ + int index_entry_exists(struct index_state *istate, const char *name, int namelen) + { + return index_name_stage_pos(istate, name, namelen, 0, NO_EXPAND_SPARSE) >= 0; 4: 016971a6711 ! 4: 97ca668102c unpack-trees: handle missing sparse directories @@ Metadata Author: Victoria Dye <vdye@xxxxxxxxxx> ## Commit message ## - unpack-trees: handle missing sparse directories + unpack-trees: unpack new trees as sparse directories - If a sparse directory does not exist in the index, unpack it at the - directory level rather than recursing into it an unpacking its contents - file-by-file. This helps keep the sparse index as collapsed as possible in - cases such as 'git reset --hard' restoring a sparse directory. + If 'unpack_single_entry()' is unpacking a new directory tree (that is, one + not already present in the index) into a sparse index, unpack the tree as a + sparse directory rather than traversing its contents and unpacking each file + individually. This helps keep the sparse index as collapsed as possible in + cases such as 'git reset --hard' restoring a outside-of-cone directory + removed with 'git rm -r --sparse'. - A directory is determined to be truly non-existent in the index (rather than - the parent of existing index entries), if 1) its path is outside the sparse - cone and 2) there are no children of the directory in the index. This check - is performed by 'missing_dir_is_sparse()' in 'unpack_single_entry()'. If the - directory is a missing sparse dir, 'unpack_single_entry()' will proceed - with unpacking it. This determination is also propagated back up to - 'unpack_callback()' via 'is_missing_sparse_dir' to prevent further tree - traversal into the unpacked directory. + Without this patch, 'unpack_single_entry()' will only unpack a directory + into the index as a sparse directory (rather than traversing into it and + unpacking its files one-by-one) if an entry with the same name already + exists in the index. This patch allows sparse directory unpacking without a + matching index entry when the following conditions are met: + + 1. the directory's path is outside the sparse cone, and + 2. there are no children of the directory in the index + + If a directory meets these requirements (as determined by + 'is_new_sparse_dir()'), 'unpack_single_entry()' unpacks the sparse directory + index entry and propagates the decision back up to 'unpack_callback()' to + prevent unnecessary tree traversal into the unpacked directory. Reported-by: Shaoxuan Yuan <shaoxuan.yuan02@xxxxxxxxx> Signed-off-by: Victoria Dye <vdye@xxxxxxxxxx> @@ unpack-trees.c: static struct cache_entry *create_ce_entry(const struct traverse } +/* -+ * Determine whether the path specified corresponds to a sparse directory -+ * completely missing from the index. This function is assumed to only be -+ * called when the named path isn't already in the index. ++ * Determine whether the path specified by 'p' should be unpacked as a new ++ * sparse directory in a sparse index. A new sparse directory 'A/': ++ * - must be outside the sparse cone. ++ * - must not already be in the index (i.e., no index entry with name 'A/' ++ * exists). ++ * - must not have any child entries in the index (i.e., no index entry ++ * 'A/<something>' exists). ++ * If 'p' meets the above requirements, return 1; otherwise, return 0. + */ -+static int missing_dir_is_sparse(const struct traverse_info *info, -+ const struct name_entry *p) ++static int entry_is_new_sparse_dir(const struct traverse_info *info, ++ const struct name_entry *p) +{ + int res, pos; + struct strbuf dirpath = STRBUF_INIT; + struct unpack_trees_options *o = info->data; + ++ if (!S_ISDIR(p->mode)) ++ return 0; ++ + /* -+ * First, check whether the path is in the sparse cone. If it is, -+ * then this directory shouldn't be sparse. ++ * If the path is inside the sparse cone, it can't be a sparse directory. + */ + strbuf_add(&dirpath, info->traverse_path, info->pathlen); + strbuf_add(&dirpath, p->path, p->pathlen); @@ unpack-trees.c: static struct cache_entry *create_ce_entry(const struct traverse + goto cleanup; + } + -+ /* -+ * Given that the directory is not inside the sparse cone, it could be -+ * (partially) expanded in the index. If child entries exist, the path -+ * is not a missing sparse directory. -+ */ + pos = index_name_pos_sparse(o->src_index, dirpath.buf, dirpath.len); -+ if (pos >= 0) -+ BUG("cache entry '%s%s' shouldn't exist in the index", -+ info->traverse_path, p->path); ++ if (pos >= 0) { ++ /* Path is already in the index, not a new sparse dir */ ++ res = 0; ++ goto cleanup; ++ } + ++ /* Where would this sparse dir be inserted into the index? */ + pos = -pos - 1; + if (pos >= o->src_index->cache_nr) { ++ /* ++ * Sparse dir would be inserted at the end of the index, so we ++ * know it has no child entries. ++ */ + res = 1; + goto cleanup; + } + ++ /* ++ * If the dir has child entries in the index, the first would be at the ++ * position the sparse directory would be inserted. If the entry at this ++ * position is inside the dir, not a new sparse dir. ++ */ + res = strncmp(o->src_index->cache[pos]->name, dirpath.buf, dirpath.len); + +cleanup: @@ unpack-trees.c: static int unpack_single_entry(int n, unsigned long mask, const struct name_entry *names, - const struct traverse_info *info) + const struct traverse_info *info, -+ int *is_missing_sparse_dir) ++ int *is_new_sparse_dir) { int i; struct unpack_trees_options *o = info->data; @@ unpack-trees.c: static int unpack_single_entry(int n, unsigned long mask, - if (mask == dirmask && !src[0]) - return 0; -+ *is_missing_sparse_dir = 0; ++ *is_new_sparse_dir = 0; + if (mask == dirmask && !src[0]) { + /* -+ * If the directory is completely missing from the index but -+ * would otherwise be a sparse directory, we should unpack it. -+ * If not, we'll return and continue recursively traversing the -+ * tree. ++ * If we're not in a sparse index, we can't unpack a directory ++ * without recursing into it, so we return. + */ + if (!o->src_index->sparse_index) + return 0; @@ unpack-trees.c: static int unpack_single_entry(int n, unsigned long mask, + while (!p->mode) + p++; + -+ *is_missing_sparse_dir = missing_dir_is_sparse(info, p); -+ if (!*is_missing_sparse_dir) ++ /* ++ * If the directory is completely missing from the index but ++ * would otherwise be a sparse directory, we should unpack it. ++ * If not, we'll return and continue recursively traversing the ++ * tree. ++ */ ++ *is_new_sparse_dir = entry_is_new_sparse_dir(info, p); ++ if (!*is_new_sparse_dir) + return 0; + } @@ unpack-trees.c: static int unpack_single_entry(int n, unsigned long mask, - if (mask == dirmask && src[0] && - S_ISSPARSEDIR(src[0]->ce_mode)) + if (mask == dirmask && -+ (*is_missing_sparse_dir || (src[0] && S_ISSPARSEDIR(src[0]->ce_mode)))) ++ (*is_new_sparse_dir || (src[0] && S_ISSPARSEDIR(src[0]->ce_mode)))) conflicts = 0; /* @@ unpack-trees.c: static int unpack_sparse_callback(int n, unsigned long mask, uns struct cache_entry *src[MAX_UNPACK_TREES + 1] = { NULL, }; struct unpack_trees_options *o = info->data; - int ret; -+ int ret, is_missing_sparse_dir; ++ int ret, is_new_sparse_dir; assert(o->merge); @@ unpack-trees.c: static int unpack_sparse_callback(int n, unsigned long mask, uns * 'dirmask' accordingly. */ - ret = unpack_single_entry(n - 1, mask >> 1, dirmask >> 1, src, names + 1, info); -+ ret = unpack_single_entry(n - 1, mask >> 1, dirmask >> 1, src, names + 1, info, &is_missing_sparse_dir); ++ ret = unpack_single_entry(n - 1, mask >> 1, dirmask >> 1, src, names + 1, info, &is_new_sparse_dir); if (src[0]) discard_cache_entry(src[0]); @@ unpack-trees.c: static int unpack_callback(int n, unsigned long mask, unsigned l struct cache_entry *src[MAX_UNPACK_TREES + 1] = { NULL, }; struct unpack_trees_options *o = info->data; const struct name_entry *p = names; -+ int is_missing_sparse_dir; ++ int is_new_sparse_dir; /* Find first entry with a real name (we could use "mask" too) */ while (!p->mode) @@ unpack-trees.c: static int unpack_callback(int n, unsigned long mask, unsigned l } - if (unpack_single_entry(n, mask, dirmask, src, names, info) < 0) -+ if (unpack_single_entry(n, mask, dirmask, src, names, info, &is_missing_sparse_dir)) ++ if (unpack_single_entry(n, mask, dirmask, src, names, info, &is_new_sparse_dir)) return -1; if (o->merge && src[0]) { @@ unpack-trees.c: static int unpack_callback(int n, unsigned long mask, unsigned l } if (!is_sparse_directory_entry(src[0], names, info) && -+ !is_missing_sparse_dir && ++ !is_new_sparse_dir && traverse_trees_recursive(n, dirmask, mask & ~dirmask, names, info) < 0) { return -1; -- gitgitgadget