On Wed, Mar 31, 2021 at 6:50 PM Derrick Stolee via GitGitGadget <gitgitgadget@xxxxxxxxx> wrote: > > From: Derrick Stolee <dstolee@xxxxxxxxxxxxx> > > Some users of the index API have a specific path they are looking for, > but choose to use index_file_exists() to rely on the name-hash hashtable > instead of doing binary search with index_name_pos(). These users only > need to know a yes/no answer, not a position within the cache array. > > When the index is sparse, the name-hash hash table does not contain the > full list of paths within sparse directories. It _does_ contain the > directory names for the sparse-directory entries. > > Create a helper function, expand_to_path(), for intended use with the > name-hash hashtable functions. The integration with name-hash.c will > follow in a later change. > > The solution here is to use ensure_full_index() when we determine that > the requested path is within a sparse directory entry. This will > populate the name-hash hashtable as the index is recomputed from > scratch. > > There may be cases where the caller is trying to find an untracked path > that is not in the index but also is not within a sparse directory > entry. We want to minimize the overhead for these requests. If we used > index_name_pos() to find the insertion order of the path, then we could > determine from that position if a sparse-directory exists. (In fact, > just calling index_name_pos() in that case would lead to expanding the > index to a full index.) However, this takes O(log N) time where N is the > number of cache entries. > > To keep the performance of this call based mostly on the input string, > use index_file_exists() to look for the ancestors of the path. Using the > heuristic that a sparse directory is likely to have a small number of > parent directories, we start from the bottom and build up. Use a string > buffer to allow mutating the path name to terminate after each slash for > each hashset test. > > Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> > --- > sparse-index.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++ > sparse-index.h | 13 +++++++++ > 2 files changed, 85 insertions(+) > > diff --git a/sparse-index.c b/sparse-index.c > index 95ea17174da3..8a1223041296 100644 > --- a/sparse-index.c > +++ b/sparse-index.c > @@ -283,3 +283,75 @@ void ensure_full_index(struct index_state *istate) > > trace2_region_leave("index", "ensure_full_index", istate->repo); > } > + > +/* > + * This static global helps avoid infinite recursion between > + * expand_to_path() and index_file_exists(). > + */ > +static int in_expand_to_path = 0; > + > +void expand_to_path(struct index_state *istate, > + const char *path, size_t pathlen, int icase) > +{ > + struct strbuf path_mutable = STRBUF_INIT; > + size_t substr_len; > + > + /* prevent extra recursion */ > + if (in_expand_to_path) > + return; > + > + if (!istate || !istate->sparse_index) > + return; > + > + if (!istate->repo) > + istate->repo = the_repository; > + > + in_expand_to_path = 1; > + > + /* > + * We only need to actually expand a region if the > + * following are both true: > + * > + * 1. 'path' is not already in the index. > + * 2. Some parent directory of 'path' is a sparse directory. > + */ > + > + if (index_file_exists(istate, path, pathlen, icase)) > + goto cleanup; > + > + strbuf_add(&path_mutable, path, pathlen); > + strbuf_addch(&path_mutable, '/'); > + > + /* Check the name hash for all parent directories */ > + substr_len = 0; > + while (substr_len < pathlen) { > + char temp; > + char *replace = strchr(path_mutable.buf + substr_len, '/'); > + > + if (!replace) > + break; > + > + /* replace the character _after_ the slash */ > + replace++; > + temp = *replace; > + *replace = '\0'; > + if (index_file_exists(istate, path_mutable.buf, > + path_mutable.len, icase)) { > + /* > + * We found a parent directory in the name-hash > + * hashtable, which means that this entry could > + * exist within a sparse-directory entry. Expand > + * accordingly. "this entry" is a bit unclear; it might be worth referring to the "path" variable instead. I think it'd also help to explain the reasoning for using the '/' character, because while it's clear when looking at the series, just running across it in the code it might be easy to forget or miss. Perhaps: * We found a parent directory in the name-hash * hashtable, because only sparse directory entries * have a trailing '/' character. Since "path" wasn't * in the index, perhaps it exists within this * sparse-directory. Expand accordingly. > + */ > + ensure_full_index(istate); > + break; > + } > + > + *replace = temp; > + substr_len = replace - path_mutable.buf; > + } > + > +cleanup: > + strbuf_release(&path_mutable); > + in_expand_to_path = 0; > +} > diff --git a/sparse-index.h b/sparse-index.h > index 0268f38753c0..1115a0d7dd98 100644 > --- a/sparse-index.h > +++ b/sparse-index.h > @@ -4,6 +4,19 @@ > struct index_state; > int convert_to_sparse(struct index_state *istate); > > +/* > + * Some places in the codebase expect to search for a specific path. > + * This path might be outside of the sparse-checkout definition, in > + * which case a sparse-index may not contain a path for that index. > + * > + * Given an index and a path, check to see if a leading directory for > + * 'path' exists in the index as a sparse directory. In that case, > + * expand that sparse directory to a full range of cache entries and > + * populate the index accordingly. > + */ > +void expand_to_path(struct index_state *istate, > + const char *path, size_t pathlen, int icase); > + > struct repository; > int set_sparse_index_config(struct repository *repo, int enable); > > -- > gitgitgadget