Re: [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/1/2022 8:28 PM, Victoria Dye wrote:
> Shaoxuan Yuan wrote:
>> Before this patch, whenever --sparse is used, `git-grep` utilizes the
>> ensure_full_index() method to expand the index and search all the
>> entries. Because this method requires walking all the trees and
>> constructing the index, it is the slow part within the whole command.
>>
>> To achieve better performance, this patch uses grep_tree() to search the
>> sparse directory entries and get rid of the ensure_full_index() method.
>>
>> Why grep_tree() is a better choice over ensure_full_index()?
>>
>> 1) grep_tree() is as correct as ensure_full_index(). grep_tree() looks
>>    into every sparse-directory entry (represented by a tree) recursively
>>    when looping over the index, and the result of doing so matches the
>>    result of expanding the index.
>>
>> 2) grep_tree() utilizes pathspecs to limit the scope of searching.
>>    ensure_full_index() always expands the index when --sparse is used,
>>    that means it will always walk all the trees and blobs in the repo
>>    without caring if the user only wants a subset of the content, i.e.
>>    using a pathspec. On the other hand, grep_tree() will only search
>>    the contents that match the pathspec, and thus possibly walking fewer
>>    trees.
>>
>> 3) grep_tree() does not construct and copy back a new index, while
>>    ensure_full_index() does. This also saves some time.
> 
> Would you mind adding some 'ensure_not_expanded' cases to 't1092' to codify
> this (probably in the 'grep is not expanded' test created in patch 2)? If
> I'm understanding this patch correctly, you've updated 'git grep' so that it
> *never* needs to expand the index. In that case, it would be good to
> exercise a bunch of 'git grep' options (pathspecs inside and outside the
> sparse cone, wildcard pathspecs, etc.) to confirm that.

Sure!

>>
>> ----------------
>> Performance test
>>
>> - Summary:
>>
>> p2000 tests demonstrate a ~91% execution time reduction for
>> `git grep --cached --sparse <pattern> -- <pathspec>` using tree-walking
>> logic.
>>
>> Test                                                                          HEAD~   HEAD
>> ---------------------------------------------------------------------------------------------------
>> 2000.78: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (full-v3)     0.11    0.09 (≈)
>> 2000.79: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (full-v4)     0.08    0.09 (≈)
>> 2000.80: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (sparse-v3)   0.44    0.04 (-90.9%)
>> 2000.81: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (sparse-v4)   0.46    0.04 (-91.3%)
> 
> These are fantastic results!
> 
>>
>> - Command used for testing:
>>
>> 	git grep --cached --sparse bogus -- f2/f1/f1/builtin/*
>>
>> The reason for specifying a pathspec is that, if we don't specify a
>> pathspec, then grep_tree() will walk all the trees and blobs to find the
>> pattern, and the time consumed doing so is not too different from using
>> the original ensure_full_index() method, which also spends most of the
>> time walking trees. However, when a pathspec is specified, this latest
>> logic will only walk the area of trees enclosed by the pathspec, and the
>> time consumed is reasonably a lot less.
>>
>> That is, if we don't specify a pathspec, the performance difference [1]
>> is quite small: both methods walk all the trees and take generally same
>> amount of time (even with the index construction time included for
>> ensure_full_index()).
> 
> This makes sense, thanks for the thorough explanation of the results.
> 
>>
>> [1] Performance test result without pathspec:
>>
>> 	Test                                                    HEAD~  HEAD
>> 	-----------------------------------------------------------------------------
>> 	2000.78: git grep --cached --sparse bogus (full-v3)     6.17   5.19 (≈)
>> 	2000.79: git grep --cached --sparse bogus (full-v4)     6.19   5.46 (≈)
>> 	2000.80: git grep --cached --sparse bogus (sparse-v3)   6.57   6.44 (≈)
>> 	2000.81: git grep --cached --sparse bogus (sparse-v4)   6.65   6.28 (≈)
>>
>> Suggested-by: Derrick Stolee <derrickstolee@xxxxxxxxxx>
>> Helped-by: Derrick Stolee <derrickstolee@xxxxxxxxxx>
>> Helped-by: Victoria Dye <vdye@xxxxxxxxxx>
>> Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@xxxxxxxxx>
>> ---
>>  builtin/grep.c                    | 32 ++++++++++++++++++++++++++-----
>>  t/perf/p2000-sparse-operations.sh |  1 +
>>  2 files changed, 28 insertions(+), 5 deletions(-)
>>
>> diff --git a/builtin/grep.c b/builtin/grep.c
>> index a0b4dbc1dc..8c0edccd8e 100644
>> --- a/builtin/grep.c
>> +++ b/builtin/grep.c
>> @@ -522,9 +522,6 @@ static int grep_cache(struct grep_opt *opt,
>>  	if (repo_read_index(repo) < 0)
>>  		die(_("index file corrupt"));
>>  
>> -	if (grep_sparse)
>> -		ensure_full_index(repo->index);
>> -
>>  	for (nr = 0; nr < repo->index->cache_nr; nr++) {
>>  		const struct cache_entry *ce = repo->index->cache[nr];
>>  
>> @@ -537,8 +534,26 @@ static int grep_cache(struct grep_opt *opt,
>>  
>>  		strbuf_setlen(&name, name_base_len);
>>  		strbuf_addstr(&name, ce->name);
>> +		if (S_ISSPARSEDIR(ce->ce_mode)) {
>> +			enum object_type type;
>> +			struct tree_desc tree;
>> +			void *data;
>> +			unsigned long size;
>> +			struct strbuf base = STRBUF_INIT;
>> +
>> +			strbuf_addstr(&base, ce->name);
>> +
>> +			data = read_object_file(&ce->oid, &type, &size);
>> +			init_tree_desc(&tree, data, size);
>>  
>> -		if (S_ISREG(ce->ce_mode) &&
>> +			/*
>> +			 * sneak in the ce_mode using check_attr parameter
>> +			 */
>> +			hit |= grep_tree(opt, pathspec, &tree, &base,
>> +					 base.len, ce->ce_mode);
>> +			strbuf_release(&base);
>> +			free(data);
>> +		} else if (S_ISREG(ce->ce_mode) &&
>>  		    match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL,
>>  				   S_ISDIR(ce->ce_mode) ||
>>  				   S_ISGITLINK(ce->ce_mode))) {
>> @@ -598,7 +613,14 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>  		int te_len = tree_entry_len(&entry);
>>  
>>  		if (match != all_entries_interesting) {
>> -			strbuf_addstr(&name, base->buf + tn_len);
>> +			if (S_ISSPARSEDIR(check_attr)) {
>> +				// object is a sparse directory entry
>> +				strbuf_addbuf(&name, base);
>> +			} else {
>> +				// object is a commit or a root tree
>> +				strbuf_addstr(&name, base->buf + tn_len);
>> +			}
> 
> Hmm, I'm not entirely sure I follow what's going on with 'name'. I'll try to
> talk myself through it.
> 
> Stepping back a bit in the context of 'grep_tree()': the goal of the
> function is, given a tree descriptor 'tree', to recursively scan the tree to
> find any 'grep' matches within items matching 'pathspec'. It is also called
> with a strbuf 'base', a length 'tn_len', and a boolean 'check_attr'; it's
> not immediately clear to me what those args are or what they do. What I can

I was confused for quite a while about the meaning of these args, too.

I think 'base' is the object's ref or SHA, e.g. HEAD, HEAD~, or a <SHA>.
Before this patch, the object was expected to be a root tree or a
commit. I _think_ 'base' can also be "<submodule>/", e.g. "sub/" when
grepping a submodule.

'tn_len' stands for "tree_name_len"?

'check_attr', as you wrote below, is for "commit or not", at lease that
was all its use case before this patch.

> see is that:
> 
> - 'check_attr' is true iff the "tree" being grepped is actually a commit. 

I think this is correct. Though as Derrick Stolee said here [1], this
patch is abusing the 'check_attr' (passing 'ce_mode' through it), and if
that caused any confusions, my apologies.

[1]
https://lore.kernel.org/git/e74b326d-ce4a-31c3-5424-e35858cdb569@xxxxxxxxxx

> - both non-recursive callers ('grep_object()' and 'grep_submodule()') call
>   'grep_tree()' with 'tn_len == base.len'.
> 
> Stepping into 'grep_tree()', we iterate over the entries *inside of* 'tree'.
> We assign the length of the tree entry's path to 'te_len'. Notably, a tree
> entry's path *not* the path from the root of the repo to the entry - it's
> just the filename of the entry (e.g., for entry 'folder1/a', the path is
> 'a').

Yes.

> Next, we skip the first 'tn_len' characters of 'base->buf' and assign that
> value to 'name'. Because 'tn_len == base.len', for this first iteration,
> it's an empty string. Then, we check if the tree entry is interesting with
> path 'name'. But 'name' is an empty string, so 'tree_entry_interesting()'
> thinks the tree entry is at the root of the repository, even if it isn't!

Yes, that is the reason why it kept ignoring sub-root-level trees: the
pathspec can never match a tree that is not at root level if this
root-level assumption exists.

> At this point, I think I've figured out what the deal with 'base' is. Before
> this patch, only 'grep_object()' and 'grep_submodule()'. In the former case,
> it's either "<objectname>:", or empty; in the latter, it's the path to the
> submodule. Both of those are things you'd want to skip to get the correct

Yep, this resonates with my reply above!

> path to the tree entry for 'tree_entry_interesting()', but it isn't true in
> your case; you need the path from the repository root to your tree for
> 'tree_entry_interesting()' to work properly. 

Well said! I think this phrasing is very accurate.

> Based on all of that, I *think* you can drop the 'check_attr' changes to
> 'grep_tree()' and update how you provide 'base' and 'tn_len' so
> 1) 'base' is the path to the tree root, and 2) 'tn_len' is 0 so that full
> path is provided to 'tree_entry_interesting()':
> 
> ----->8----->8----->8----->8----->8----->8----->8----->8----->8----->8-----
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 8c0edccd8e..85c83190f1 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -546,11 +546,7 @@ static int grep_cache(struct grep_opt *opt,
>  			data = read_object_file(&ce->oid, &type, &size);
>  			init_tree_desc(&tree, data, size);
>  
> -			/*
> -			 * sneak in the ce_mode using check_attr parameter
> -			 */
> -			hit |= grep_tree(opt, pathspec, &tree, &base,
> -					 base.len, ce->ce_mode);
> +			hit |= grep_tree(opt, pathspec, &tree, &base, 0, 0);
>  			strbuf_release(&base);
>  			free(data);
>  		} else if (S_ISREG(ce->ce_mode) &&
> @@ -613,14 +609,6 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  		int te_len = tree_entry_len(&entry);
>  
>  		if (match != all_entries_interesting) {
> -			if (S_ISSPARSEDIR(check_attr)) {
> -				// object is a sparse directory entry
> -				strbuf_addbuf(&name, base);
> -			} else {
> -				// object is a commit or a root tree
> -				strbuf_addstr(&name, base->buf + tn_len);
> -			}
> -
>  			match = tree_entry_interesting(repo->index,
>  						       &entry, &name,
>  						       0, pathspec);
> -----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<----- 

Thank you for this diff! I think this is also what Derrick suggested in
his review. In fact, this approach is more to the root of the problem:
the expected format of the path/base.

> I still find all of this confusing, and it's possible I'm still not properly
> understanding how 'name' and 'tn_len' are supposed to be used. Regardless, I
> *am* fairly certain that finding the right values for those args is the
> going to be the cleanest (and least fragile) way to handle sparse
> directories, rather than using the 'check_attr' arg for something it isn't.

Right.

> It might take some time + lots of debugging/experimenting, but it's really
> important that the implementation you settle on is something you (and,
> ideally, the readers of your patches) confidently and completely understand,
> rather than something that seems to work but doesn't have a clear
> explanation. As always, I'm happy to help if you'd like another set of eyes
> on the problem!

Right. I admit that the approach I was taking is pretty shady. The way
suggested by you and Derrick is more explainable and to-the-point.
Lesson learned!

Thanks,
Shaoxuan




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux