Re: [PATCH v4 1/4] implement submodule config cache for lookup of submodule names

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 02 Jun 2015 12:57:08 -0700

Heiko Voigt <hvoigt@xxxxxxxxxx> writes:

> This submodule configuration cache allows us to lazily read .gitmodules
> configurations by commit into a runtime cache which can then be used to
> easily lookup values from it. Currently only the values for path or name
> are stored but it can be extended for any value needed.
>
> It is expected that .gitmodules files do not change often between
> commits. Thats why we lookup the .gitmodules sha1 from a commit and then
> either lookup an already parsed configuration or parse and cache an
> unknown one for each sha1. The cache is lazily build on demand for each
> requested commit.
>
> This cache can be used for all purposes which need knowledge about
> submodule configurations. Example use cases are:
>
>  * Recursive submodule checkout needs lookup a submodule name from its
>    path when a submodule first appears. This needs be done before this
>    configuration exists in the worktree.
>
>  * The implementation of submodule support for 'git archive' needs to
>    lookup the submodule name to generate the archive when given a
>    revision that is not checked out.
>
>  * 'git fetch' when given the --recurse-submodules=on-demand option (or
>    configuration) needs to lookup submodule names by path from the
>    database rather than reading from the worktree. For new submodule it
>    needs to lookup the name from its path to allow cloning new
>    submodules into the .git folder so they can be checked out without
>    any network interaction when the user does a checkout of that
>    revision.

What is unclear to me after reading the above twice is what this
thing is meant to achieve.  Is it efficiency by doing lazy lookups
and caching to avoid asking the same thing more than once from
either the filesystem or read_sha1_file()?  Is it expected that
reading through this "cache" will be the _only_ way callers would
interact with the .gitmodules data, or is it an opt-in feature that
some callers that do not see the benefit (why they may want to
ignore is totally unclear, because what the "cache" system wants to
achieve is) can safely ignore and bypass?

Because the above talks about looking up ".gitmodules from a
commit", I am guessing that the "commit" used as one of the lookup
keys throughout the system is a commit in the superproject, not from
submodules, but you may want to state that more explicitly.

What, if anything, should be done for .gitmodules that are not yet
committed?  Are there cases that the callers that usually interact
with .gitmodules via this "cache" system need to use .gitmodules
immediately after adding a new submodule but before committing that
change to the superproject?  Do they code something like this?

	if (cached)
        	read .gitmodules from the index and fabricate
		struct submodule;
	else if (worktree)
        	read .gitmodules from the working tree and fabricate
		struct submodule;
	else
		call submodule_from_name("HEAD", ...) and receive
                struct submodule;

	use the struct submodule to learn from the module;

Yes, I am wondering if submodule_from_name() should be extended to
allow the former two cases, so that the caller can make a single
call above and then use resulting "struct submodule" throughout its
code after doing so.  And I also am hoping that the answer to my
questions above to be "This is not just an opt-in 'cache' API, but
we want to make it the unified API for C code to learn about what is
in .gitmodule".

> diff --git a/Documentation/technical/api-submodule-config.txt b/Documentation/technical/api-submodule-config.txt
> new file mode 100644
> index 0000000..2ff4907
> --- /dev/null
> +++ b/Documentation/technical/api-submodule-config.txt
> @@ -0,0 +1,46 @@
> +submodule config cache API
> +==========================
> +
> +The submodule config cache API allows to read submodule
> +configurations/information from specified revisions. Internally
> +information is lazily read into a cache that is used to avoid
> +unnecessary parsing of the same .gitmodule files. Lookups can be done by
> +submodule path or name.
> +
> +Usage
> +-----
> +
> +The caller can look up information about submodules by using the
> +`submodule_from_path()` or `submodule_from_name()` functions. They return
> +a `struct submodule` which contains the values. The API automatically
> +initializes and allocates the needed infrastructure on-demand.
> +
> +If the internal cache might grow too big or when the caller is done with
> +the API, all internally cached values can be freed with submodule_free().
> +
> +Data Structures
> +---------------
> +
> +`struct submodule`::
> +
> +	This structure is used to return the information about one
> +	submodule for a certain revision. It is returned by the lookup
> +	functions.

Hopefully this will not stay an opaque structure as we read later
patches ;-).

> +Functions
> +---------
> +
> +`void submodule_free()`::
> +
> +	Use these to free the internally cached values.

"These" meaning "this single function", or are there variants of it?

> diff --git a/submodule-config.c b/submodule-config.c
> new file mode 100644
> index 0000000..97f4a04
> --- /dev/null
> +++ b/submodule-config.c
> @@ -0,0 +1,445 @@
> +#include "cache.h"
> +#include "submodule-config.h"
> +#include "submodule.h"
> +#include "strbuf.h"
> +
> +/*
> + * submodule cache lookup structure
> + * There is one shared set of 'struct submodule' entries which can be
> + * looked up by their sha1 blob id of the .gitmodule file and either
> + * using path or name as key.
> + * for_path stores submodule entries with path as key
> + * for_name stores submodule entries with name as key
> + */
> +struct submodule_cache {
> +	struct hashmap for_path;
> +	struct hashmap for_name;
> +};
> +
> +/*
> + * thin wrapper struct needed to insert 'struct submodule' entries to
> + * the hashmap
> + */
> +struct submodule_entry {
> +	struct hashmap_entry ent;
> +	struct submodule *config;
> +};

The above, and the singleton-ness of the "cache", implies that we
can have only one "struct submodule" for a given path (or a name).
Does that mean the subsystem implicitly is tied to a single commit
at the superproject level?

What happens when I call submodule_from_path() for a single
submodule at one commit in the superproject, and then ask about that
same submodule for another commit in the superproject, which may
have a different version of .gitmodules, by calling the same
function again?

	Side note: I think I know the answer to these questions,
	after reading the hash function.  for_path does not store
	submodule entries with path as key.  It uses the commit and
	the path as a combined key, so both HEAD:.gitmodules and
	HEAD^:.gitmodules can be cached and looked up separatedly if
	their contents are different.  The comment and field names
	of "struct submodule_cache" may want to be improved.

When do we evict the cache?  I am wondering what would happen when
you do "git log --recursive" at the superproject level, which may
grow the cache in an unbounded way without some eviction policy.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html