Heiko Voigt <hvoigt@xxxxxxxxxx> writes: > This submodule configuration cache allows us to lazily read .gitmodules > configurations by commit into a runtime cache which can then be used to > easily lookup values from it. Currently only the values for path or name > are stored but it can be extended for any value needed. > > It is expected that .gitmodules files do not change often between > commits. Thats why we lookup the .gitmodules sha1 from a commit and then > either lookup an already parsed configuration or parse and cache an > unknown one for each sha1. The cache is lazily build on demand for each > requested commit. > > This cache can be used for all purposes which need knowledge about > submodule configurations. Example use cases are: > > * Recursive submodule checkout needs lookup a submodule name from its > path when a submodule first appears. This needs be done before this > configuration exists in the worktree. > > * The implementation of submodule support for 'git archive' needs to > lookup the submodule name to generate the archive when given a > revision that is not checked out. > > * 'git fetch' when given the --recurse-submodules=on-demand option (or > configuration) needs to lookup submodule names by path from the > database rather than reading from the worktree. For new submodule it > needs to lookup the name from its path to allow cloning new > submodules into the .git folder so they can be checked out without > any network interaction when the user does a checkout of that > revision. What is unclear to me after reading the above twice is what this thing is meant to achieve. Is it efficiency by doing lazy lookups and caching to avoid asking the same thing more than once from either the filesystem or read_sha1_file()? Is it expected that reading through this "cache" will be the _only_ way callers would interact with the .gitmodules data, or is it an opt-in feature that some callers that do not see the benefit (why they may want to ignore is totally unclear, because what the "cache" system wants to achieve is) can safely ignore and bypass? Because the above talks about looking up ".gitmodules from a commit", I am guessing that the "commit" used as one of the lookup keys throughout the system is a commit in the superproject, not from submodules, but you may want to state that more explicitly. What, if anything, should be done for .gitmodules that are not yet committed? Are there cases that the callers that usually interact with .gitmodules via this "cache" system need to use .gitmodules immediately after adding a new submodule but before committing that change to the superproject? Do they code something like this? if (cached) read .gitmodules from the index and fabricate struct submodule; else if (worktree) read .gitmodules from the working tree and fabricate struct submodule; else call submodule_from_name("HEAD", ...) and receive struct submodule; use the struct submodule to learn from the module; Yes, I am wondering if submodule_from_name() should be extended to allow the former two cases, so that the caller can make a single call above and then use resulting "struct submodule" throughout its code after doing so. And I also am hoping that the answer to my questions above to be "This is not just an opt-in 'cache' API, but we want to make it the unified API for C code to learn about what is in .gitmodule". > diff --git a/Documentation/technical/api-submodule-config.txt b/Documentation/technical/api-submodule-config.txt > new file mode 100644 > index 0000000..2ff4907 > --- /dev/null > +++ b/Documentation/technical/api-submodule-config.txt > @@ -0,0 +1,46 @@ > +submodule config cache API > +========================== > + > +The submodule config cache API allows to read submodule > +configurations/information from specified revisions. Internally > +information is lazily read into a cache that is used to avoid > +unnecessary parsing of the same .gitmodule files. Lookups can be done by > +submodule path or name. > + > +Usage > +----- > + > +The caller can look up information about submodules by using the > +`submodule_from_path()` or `submodule_from_name()` functions. They return > +a `struct submodule` which contains the values. The API automatically > +initializes and allocates the needed infrastructure on-demand. > + > +If the internal cache might grow too big or when the caller is done with > +the API, all internally cached values can be freed with submodule_free(). > + > +Data Structures > +--------------- > + > +`struct submodule`:: > + > + This structure is used to return the information about one > + submodule for a certain revision. It is returned by the lookup > + functions. Hopefully this will not stay an opaque structure as we read later patches ;-). > +Functions > +--------- > + > +`void submodule_free()`:: > + > + Use these to free the internally cached values. "These" meaning "this single function", or are there variants of it? > diff --git a/submodule-config.c b/submodule-config.c > new file mode 100644 > index 0000000..97f4a04 > --- /dev/null > +++ b/submodule-config.c > @@ -0,0 +1,445 @@ > +#include "cache.h" > +#include "submodule-config.h" > +#include "submodule.h" > +#include "strbuf.h" > + > +/* > + * submodule cache lookup structure > + * There is one shared set of 'struct submodule' entries which can be > + * looked up by their sha1 blob id of the .gitmodule file and either > + * using path or name as key. > + * for_path stores submodule entries with path as key > + * for_name stores submodule entries with name as key > + */ > +struct submodule_cache { > + struct hashmap for_path; > + struct hashmap for_name; > +}; > + > +/* > + * thin wrapper struct needed to insert 'struct submodule' entries to > + * the hashmap > + */ > +struct submodule_entry { > + struct hashmap_entry ent; > + struct submodule *config; > +}; The above, and the singleton-ness of the "cache", implies that we can have only one "struct submodule" for a given path (or a name). Does that mean the subsystem implicitly is tied to a single commit at the superproject level? What happens when I call submodule_from_path() for a single submodule at one commit in the superproject, and then ask about that same submodule for another commit in the superproject, which may have a different version of .gitmodules, by calling the same function again? Side note: I think I know the answer to these questions, after reading the hash function. for_path does not store submodule entries with path as key. It uses the commit and the path as a combined key, so both HEAD:.gitmodules and HEAD^:.gitmodules can be cached and looked up separatedly if their contents are different. The comment and field names of "struct submodule_cache" may want to be improved. When do we evict the cache? I am wondering what would happen when you do "git log --recursive" at the superproject level, which may grow the cache in an unbounded way without some eviction policy. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html