On Tue, Apr 21, 2020 at 12:08 AM Elijah Newren <newren@xxxxxxxxx> wrote: > > On Mon, Apr 20, 2020 at 7:11 PM Matheus Tavares Bernardino > <matheus.bernardino@xxxxxx> wrote: > > > > I've been working on the file skipping mechanism using the sparsity > > patterns directly. But I'm uncertain about some implementation > > details. So I wanted to share my current plan with you, to get some > > feedback before going deeper. > > > > The first idea was to load the sparsity patterns a priori and pass > > them to grep_tree(), which recursively greps the entries of a given > > tree object. If --recurse-submodules is given, however, we would also > > need to load each surepo's sparse-checkout file on the fly (as the > > subrepos are lazily initialized in grep_tree()'s call chain). That's > > not a problem on its own. But in the most naive implementation, this > > means unnecessarily re-loading the sparse-checkout files of the > > submodules for each tree given to git-grep (as grep_tree() is called > > separately for each one of them). > > Wouldn't loading the sparse-checkout files be fast compared to > grepping a submodule for matching strings? And not just fast, but > essentially in the noise and hard to even measure? I have a hard time > fathoming parsing the sparse-checkout file for a submodule somehow > appreciably affecting the cost of grepping through that submodule. If > the submodule has a huge number of sparse-checkout patterns, that'll > be because it has a ginormous number of files and grepping through > them all would be way, way longer. If the submodule only has a few > files, then the sparse-checkout file is only going to be a few lines > at most. Yeah, makes sense. > Also, from another angle: I think the original intent of submodules > was an alternate form of sparse-checkout/partial-clone, letting people > deal with just their piece of the repo. As such, do we really even > expect people to use sparse-checkouts and submodules together, let > alone use them very heavily together? Sure, someone will use them, > but I have a hard time imagining the scale of use of both features > heavily enough for this to matter, especially since it also requires > specifying multiple trees to grep (which is slightly unusual) in > addition to the combination of these other features before your > optimization here could kick in and be worthwhile. > > I'd be very tempted to just implement the most naive implementation > and maybe leave a TODO note in the code for some future person to come > along and optimize if it really matters, but I'd like to see numbers > before we spend the development and maintenance effort on it because > I'm having a hard time imagining any scale where it could matter. You're right. I guess I got a little too excited about the optimizations possibilities and neglected the fact that they might not even be needed here. Just to take a look at some numbers, I prototyped the naive implementation and downloaded a testing repository[1] containing 8 submodules (or 14 counting the nested ones). For each of the non-nested submodules, I added its .gitignore rules to the sparse-checkout file (of course this doesn't make any sense for a real-world usage, but I just wanted to populate the file with a large quantity of valid rules, to test the parsing time). I also added the rule '/*'. Then I ran: git-grep --threads=1 --recurse-submodules -E "config_[a-z]+\(" $(cat /tmp/trees) Where /tmp/trees contained about 120 trees in the said repository (again, a probably unreal case, for testing purposes only). Then, measuring the time spent only inside the function I created to load a sparse-checkout file for a given 'struct repository', I got to the following numbers: Number of calls: 1531 (makes sense: ~120 trees and 14 submodules) Percentage over the total time: 0.015% Number of matches: 300897 And using 8 threads, I got the same numbers except for the percentage, which was a little higher: 0.05%. So, indeed, the overhead of re-loading the files is too insignificant. And my cache idea was a premature and unnecessary optimization. > > So my next idea was to implement a cache, mapping 'struct repository's > > to 'struct pattern_list'. Well, not 'struct repository' itself, but > > repo->gitdir. This way we could load each file once, store the pattern > > list, and quickly retrieve the one that affect the repository > > currently being grepped, whether it is a submodule or not. But, is > > gitidir unique per repository? If not, could we use > > repo_git_path(repo, "info/sparse-checkout") as the key? > > > > I already have a prototype implementation of the last idea (using > > repo_git_path()). But I wanted to make sure, does this seem like a > > good path? Or should we avoid the work of having this hashmap here and > > do something else, as adding a 'struct pattern_list' to 'struct > > repository', directly? > > Honestly, it sounds a bit like premature optimization to me. Sorry if > that's disappointing since you've apparently already put some effort > into this, and it sounds like you're on a good track for optimizing > this if it were necessary, but I'm just having a hard time figuring > out whether it'd really help and be worth the code complexity. No problem! I'm glad to have this feedback now, while I'm still working on v2 :) Now I can focus on what's really relevant. So thanks again! [1]: https://github.com/surevine/Metre