On 4/20/2020 11:08 PM, Elijah Newren wrote: > On Mon, Apr 20, 2020 at 7:11 PM Matheus Tavares Bernardino > <matheus.bernardino@xxxxxx> wrote: >> >> Hi, Elijah, Stolee and others >> >> On Tue, Mar 24, 2020 at 7:55 PM Matheus Tavares Bernardino >> <matheus.bernardino@xxxxxx> wrote: >>> >>> On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@xxxxxxxxx> wrote: >>>> >>>> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares >>>> <matheus.bernardino@xxxxxx> wrote: >>>>> >>>>> Something I'm not entirely sure in this patch is how we implement the >>>>> mechanism to honor sparsity for the `git grep <commit-ish>` case (which >>>>> is treated in the grep_tree() function). Currently, the patch looks for >>>>> an index entry that matches the path, and then checks its skip_worktree >>>> >>>> As you discuss below, checking the index is both wrong _and_ costly. >>>> You should use the sparsity patterns; Stolee did a lot of work to make >>>> those correspond to simple hashes you could check to determine whether >>>> to even walk into a subdirectory. >> [...] >>> OK, makes sense. >> >> I've been working on the file skipping mechanism using the sparsity >> patterns directly. But I'm uncertain about some implementation >> details. So I wanted to share my current plan with you, to get some >> feedback before going deeper. >> >> The first idea was to load the sparsity patterns a priori and pass >> them to grep_tree(), which recursively greps the entries of a given >> tree object. If --recurse-submodules is given, however, we would also >> need to load each surepo's sparse-checkout file on the fly (as the >> subrepos are lazily initialized in grep_tree()'s call chain). That's >> not a problem on its own. But in the most naive implementation, this >> means unnecessarily re-loading the sparse-checkout files of the >> submodules for each tree given to git-grep (as grep_tree() is called >> separately for each one of them). > > Wouldn't loading the sparse-checkout files be fast compared to > grepping a submodule for matching strings? And not just fast, but > essentially in the noise and hard to even measure? I have a hard time > fathoming parsing the sparse-checkout file for a submodule somehow > appreciably affecting the cost of grepping through that submodule. If > the submodule has a huge number of sparse-checkout patterns, that'll > be because it has a ginormous number of files and grepping through > them all would be way, way longer. If the submodule only has a few > files, then the sparse-checkout file is only going to be a few lines > at most. > > Also, from another angle: I think the original intent of submodules > was an alternate form of sparse-checkout/partial-clone, letting people > deal with just their piece of the repo. As such, do we really even > expect people to use sparse-checkouts and submodules together, let > alone use them very heavily together? Sure, someone will use them, > but I have a hard time imagining the scale of use of both features > heavily enough for this to matter, especially since it also requires > specifying multiple trees to grep (which is slightly unusual) in > addition to the combination of these other features before your > optimization here could kick in and be worthwhile. > > I'd be very tempted to just implement the most naive implementation > and maybe leave a TODO note in the code for some future person to come > along and optimize if it really matters, but I'd like to see numbers > before we spend the development and maintenance effort on it because > I'm having a hard time imagining any scale where it could matter. > >> So my next idea was to implement a cache, mapping 'struct repository's >> to 'struct pattern_list'. Well, not 'struct repository' itself, but >> repo->gitdir. This way we could load each file once, store the pattern >> list, and quickly retrieve the one that affect the repository >> currently being grepped, whether it is a submodule or not. But, is >> gitidir unique per repository? If not, could we use >> repo_git_path(repo, "info/sparse-checkout") as the key? >> >> I already have a prototype implementation of the last idea (using >> repo_git_path()). But I wanted to make sure, does this seem like a >> good path? Or should we avoid the work of having this hashmap here and >> do something else, as adding a 'struct pattern_list' to 'struct >> repository', directly? > > Honestly, it sounds a bit like premature optimization to me. Sorry if > that's disappointing since you've apparently already put some effort > into this, and it sounds like you're on a good track for optimizing > this if it were necessary, but I'm just having a hard time figuring > out whether it'd really help and be worth the code complexity. My initial thought was to use a stack or queue. It depend on how git-grep treats submodules. Imagine directories A, B, C where B is a submodule. If results from 'B' are output between results from 'A' and 'C', then use a stack to "push" the latest sparse-checkout patterns as you deepen into a submodule, then "pop" the patterns as you leave a submodule. If results from 'B' are output after results from 'C', then you could possibly use a queue instead. I find this unlikely, and it would behave strangely for nested submodules. Since "struct pattern_list" has most of the information you require, then it should not be challenging to create a list of them. Hopefully that provides some ideas. Thanks, -Stolee