Re: [PATCH v3 2/3] config: add hashtable for config parsing & retrieval

Karsten Blees <karsten.blees@xxxxxxxxx> · Wed, 25 Jun 2014 22:23:06 +0200

Am 25.06.2014 20:13, schrieb Junio C Hamano:
> Ramsay Jones <ramsay@xxxxxxxxxxxxxxxxxxx> writes:
> 
>> On 24/06/14 00:25, Junio C Hamano wrote:
>> ...
>>> Yup, that is a very good point.  There needs an infrastructure to
>>> tie a set of files (i.e. the standard one being the chain of
>>> system-global /etc/gitconfig to repo-specific .git/config, and any
>>> custom one that can be specified by the caller like submodule code)
>>> to a separate hashmap; a future built-in submodule code would use
>>> two hashmaps aka "config-caches", one to manage the usual
>>> "configuration" and the other to manage the contents of the
>>> .gitmodules file.
>>>
>>
>> I had expected to see one hash table per file/blob, with the three
>> standard config hash tables linked together to implement the scope/
>> priority rules. (Well, these could be merged into one, as the current
>> code does, since that makes handling "multi" keys slightly easier).
> 
> Again, good point.  I think a rough outline of a design that take
> both
> 
>  (1) we may have to read two or more separate sets of "config like
>      things" (e.g. the contents from the normal config system and
>      the contents from the .gitmodules file) and
> 
>  (2) we may have to read two or more files that make up a logically
>      single set of "config-like things" (e.g. the "normal config
>      system" reads from three separate files)
> 
> into account may look like this:
> 
>  * Each instance of in-core "config-like things" is expressed as a
>    struct "config-set".
> 
>  * A "config-set" points at an ordered set of struct "config-file",
>    each of which represents what was read and cached in-core from a
>    file.

Is this additional complexity really necessary?

How would you handle included files? Split up the including file in before / after parts? I.e.

  repo-config-file[include-to-end]
  included-file
  repo-config-file[top-to-include]
  user-config-file
  ...

Looking up a single-valued key would then be O(n) (where n is the number of sruct config_file's in the config_set) rather than O(1).

Looking up a multi-valued key would involve joining values from all files, every time the value is looked up (dynamically allocating lists on the heap etc.).

The configuration is typically loaded once, followed by lots of lookups. So from a performance perspective, doing the merging at load time is sure better.

> 
>  * When we know or notice that a single file on the filesystem was
>    modified, we do not have to invalidate the whole "config-set"
>    that depends on the file; the "config-file" that corresponds to
>    the file on the filesystem is invalidated instead.
> 

What's the use case for this? Do you expect e.g. 'git gc' to detect changed depth/window size at run time and adjust the algorithm accordingly? Or do you just intend to cache parsed config data (the latter could be done by recording all involved file names and stats in the config-set and reloading the whole thing if any of the files change)?

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html