Re: [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 18 Sep 2019 12:55:12 -0700

Matheus Tavares <matheus.bernardino@xxxxxx> writes:

> Make git-grep --recurse-submodules stop adding subrepos to the in-memory
> alternates list and, instead, pass a reference to the subrepo struct
> down to the threads.

Nice.  This is done by updating all the codepaths used by grep to
use the lower-level helper functions that can take a repository
instance and/or an object store instance that is not the one tied to
the top-level repository?  Quite nice.

> - textconv cache is written to the_repository's object database even for
>   submodules. Should it perhaps be written to submodules' odb instead?

You mention "is written", but that is what happens upon a cache
miss.  Before the code notices a cache miss, it must be checking if
a cached result is available.  In which odb is it done?  Writing
that follow the miss should happen to the same one, or the cache is
not very effective.

Because you would want the cache to be effective, after running "git
grep --recurse-submodules" from the top-level, when you chdir down
to the submodule and say "git grep" to dig further, the answer to
your question is most likely "yes".

> - Considering the following call chain: grep_source_load_driver() >
>   userdiff_find_by_path() > git_check_attr() > collect_some_attrs() >
>   prepare_attr_stack() > bootstrap_attr_stack():
>
>   * The last function tries to read the attributes from the
>     .gitattributes and .git/info/attributes files of the_repository.
>     However, for paths inside the submodule, shouldn't it try to read
>     these files from the submodule?

Yes, I think all of those would have worked correctly if we forked a
separate "git grep" inside submodule repository, but in the rush to
"do everything in process", many things like these are not done
correctly.  As there is only one attribute cache IIUC, invalidating
the whole cache for the top-level and replacing it with the one for
a submodule, every time we cross the module boundary, would probably
have a negative effect on the performance, and I am not sure what
would happen if you run more than one threads working in different
repositories (i.e. top-level and submodules).

>   * This function will also call: read_attr() > read_attr_from_index() >
>     read_blob_data_from_index() which might, in turn, call
>     read_object_file(). Shouldn't we pass the subrepo to it so that it
>     can call repo_read_object_file()? (Again, for paths inside the
>     submodule, read_object_file() won't be able to find the object as
>     we won't be adding to alternates anymore.)

Ditto.

Thanks.