Re: [RFC PATCH 0/3] grep: don't add subrepos to in-memory alternates

Matheus Tavares Bernardino <matheus.bernardino@xxxxxx> · Thu, 19 Sep 2019 02:18:57 -0300

On Wed, Sep 18, 2019 at 4:55 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Matheus Tavares <matheus.bernardino@xxxxxx> writes:
>
[...]
> > - textconv cache is written to the_repository's object database even for
> >   submodules. Should it perhaps be written to submodules' odb instead?
>
> You mention "is written", but that is what happens upon a cache
> miss.  Before the code notices a cache miss, it must be checking if
> a cached result is available.  In which odb is it done?  Writing
> that follow the miss should happen to the same one, or the cache is
> not very effective.

Right. Both writing and reading operations on the textconv cache
always happen in the_repository's odb.

> Because you would want the cache to be effective, after running "git
> grep --recurse-submodules" from the top-level, when you chdir down
> to the submodule and say "git grep" to dig further, the answer to
> your question is most likely "yes".

OK, makes sense. To keep this series simple, I think I'll try to work
on this in a following patchset.

> > - Considering the following call chain: grep_source_load_driver() >
> >   userdiff_find_by_path() > git_check_attr() > collect_some_attrs() >
> >   prepare_attr_stack() > bootstrap_attr_stack():
> >
> >   * The last function tries to read the attributes from the
> >     .gitattributes and .git/info/attributes files of the_repository.
> >     However, for paths inside the submodule, shouldn't it try to read
> >     these files from the submodule?
>
> Yes, I think all of those would have worked correctly if we forked a
> separate "git grep" inside submodule repository, but in the rush to
> "do everything in process", many things like these are not done
> correctly.  As there is only one attribute cache IIUC, invalidating
> the whole cache for the top-level and replacing it with the one for
> a submodule, every time we cross the module boundary, would probably
> have a negative effect on the performance, and I am not sure what
> would happen if you run more than one threads working in different
> repositories (i.e. top-level and submodules).

Hmm, I may have gotten a little confused here. Are you talking about
the attributes stack (which contains .gitattributes and
info/attributes)? If so, isn't this stack already rebuild for every
path? I mean, by the previous call chain it seems to me that at least
these two files are reread for every path.

> >   * This function will also call: read_attr() > read_attr_from_index() >
> >     read_blob_data_from_index() which might, in turn, call
> >     read_object_file(). Shouldn't we pass the subrepo to it so that it
> >     can call repo_read_object_file()? (Again, for paths inside the
> >     submodule, read_object_file() won't be able to find the object as
> >     we won't be adding to alternates anymore.)

Just as a side note, I noticed another problem with this: in a
submodule with a non-checked-out .gitattributes, `git grep --textconv`
will successfully retrieve the file from the index. But running `git
grep --recurse-submodules --textconv` from the superproject won't. The
problem is that read_blob_data_from_index() doesn't have the
repository struct and thus cannot strip the subrepo_prefix from the
path (so it won't find it in the subrepo's index). I'll try to fix
this in this series as well.