Hi, I am one of the three selected GSoC(Google summer of Code) students by Git this year. I am a third year computer science student from India. First, I want to thank you all for guiding me through the patch review process and being patient with me. I learned a lot from you all and this helped me polishing my proposal, so I want to thank you all for giving me this opportunity. My mentors for the project are Matthieu Moy and Ramkumar Ramachandra. My proposal topic is "Git configuration API improvements" and I am appending the whole proposal below ad verbatim for you all to review. I had posted it earlier but it may have gotten lost in the noise. I am looking forward to having a great summer working on git. ##GSoC Proposal : Git configuration API improvements Abstract: ----------- The current method for reading values from the config file (summarized from reading config.c, bulitin/config.c and api-config.txt)in the disk is parsing and reading the file each time a function calls for a config value. git_config() feeds every variable and value to the callback function which decides what to do with them. Due to this approach, it is not uncommon for the configuration to be parsed several times during the run of a Git program, with different callbacks picking out different variables useful to themselves. I propose that instead of reading and parsing the config file multiple times from the disk during a process run, we will construct a in-memory representation of the config file the first time a config value is queried for. Also, the callbacks will be rewired to use the new API functions provided for efficient lookups from the internal cache. After this implementation is validated, new features can be added such as unsetting values from config file. #Proposed Improvements * Fix "git config --unset" to clean up detritus from sections that are left empty and some more aesthetic concerns illustrated below . * Read the configuration from files once and cache the results in an appropriate data structure in memory. * Change git_config() to iterate through the pre-read values in memory rather than re-reading the configuration files. * Add new API calls that allow the cache to be inquired easily and efficiently. * Rewrite callers to use the new API wherever possible. #Possible Problems with this approach, * How to invalidate the cache correctly in the case that the configuration is changed while `git` is executing. * What to do with the comments, which are generally found in config files, and can be places anywhere(above a section header , below a section header etc.)? #Future Improvements * Allow configuration values to be unset via a config file -------------------------------------------------------------------------- ##Changing the git_config api to retrieve values from memory -------------------------------------------------------------------------- #Current Implementation of the git-config subsystem: ---------------------------------------------------- Config files are parsed linearly, and each variable found is passed to a caller-provided callback function. The callback function is responsible for any actions to be taken on the config option, and is free to ignore some options. It is not uncommon for the configuration to be parsed several times during the run of a Git program, with different callbacks picking out different variables useful to themselves. A config callback function takes three parameters: - the name of the parsed variable. This is in canonical "flat" form: the section, subsection, and variable segments will be separated by dots, and the section and variable segments will be all lowercase. E.g., `core.ignorecase`, `diff.SomeType.textconv`. - the value of the found variable, as a string. If the variable had no value specified, the value will be NULL (typically this means it should be interpreted as boolean true). - a void pointer passed in by the caller of the config API; this can contain callback-specific data A config callback should return 0 for success, or -1 if the variable could not be parsed properly. #Current Config Querying: ----------------------- Most code in git,calls `git_config` to look up variables in all config files that Git knows about, using the normal precedence rules. To do this, it calls with a callback function and void data pointer. `git_config` will read all config sources in order of increasing priority. The logic flow is attached as a image[1] . Thus a callback should typically overwrite previously-seenentries with new ones (e.g., if both the user-wide `~/.gitconfig` and repo-specific `.git/config` contains `color.ui`, the config machinery will first feed the user-wide one to the callback, and then the repo-specific one; by overwriting, the higher-priority repo-specific value is left at the end). The current precedence order rates the repo-config file the highest. The `git_config_with_options` function lets the caller examine config while adjusting some of the default behavior of `git_config`. It takes two extra parameters: `filename`:: If this parameter is non-NULL, it specifies the name of a file to parse for configuration, rather than looking in the usual files. Regular `git_config` defaults to `NULL`. `respect_includes`:: Specify whether include directives should be followed in parsed files. Regular `git_config` defaults to `1`. There is a special version of `git_config` called `git_config_early`. This version takes an additional parameter to specify the repository config, instead of having it looked up via `git_path`. This is useful early in a Git program before the repository has been found. I have made a logic diagram for better understanding the query config subsystem, [1]http://imgur.com/CToJXTr #Writing config files to disk: ------------------------------ Git provides two internal functions for setting and unsetting variables in config files, *git_config_set(const char *key, const char *value) *git_config_set_in_file(const char *config_filename,const char *key, const char *value) who call git_config_set_multivar and git_config_set_multivar_in_file respectively. In the end git_config_set_multivar_in_file does all the work. For setting and unsetting values it checks the argument value. If value==NULL, unset in (remove from) config, if value_regex!=NULL, disregard key/value pairs where value does not match. if multi_replace==0, nothing, or only one matching key/value is replaced, else all matching key/values (regardless how many) are removed, before the new pair is written. This function does this: - it locks the config file by creating ".git/config.lock" - it then parses the config using store_aux() as validator to find the position on the key/value pair to replace. If it is to be unset, it must be found exactly once. - the config file is mmap()ed and the part before the match (if any) is written to the lock file, then the changed part and the rest. - the config file is removed and the lock file renamed to it. #New Approach and implementation details: ---------------------------------------- We parse the config file once, storing the raw values to records in memory. After the whole config has been read, iterate through the records, feeding the surviving values into the callback in the order they were originally read (minus deletions). Path to follow for the api conversion, 1. Convert the parser to read into an in-memory representation, but leave git_config() as a wrapper which iterates over it. 2. Add query functions like config_string_get() which will inquire cache for values efficiently. 3. Convert callbacks to query functions one by one. #Choosing a data structure for internal cache: ---------------------------------------------- Currently, the callbacks receives key,value pairs after parsing in a flat form like,"diff.sometype.textconv" and value "true" .These key-value pairs easily translate to a hashmap or a dictionary structure. This data-structure reflects the configuration variables themselves, not the way they appear in the config file. A map (hashtable) associating to each config variable the corresponding value (which maybe a scalar value,boolean or a list, depending on the variable) would be most appropriate in my opinion.They will be saved as "string":"string list" pairs as there are already many helper functions to help converting the string values into their appropriate data type. Advantages of using a Hashmap:- ------------------------------ >>Constant time lookups(O(1)) and deletions.After the initial parsing ,the system would use the hashmaps for lookups. >>Read-only accesses happen far more often than updates, so the data structure is optimised for lookups. >>Since the key,value pairs are string like for example, section.subsection.variable = value ,they are easily hashable and are appropriate for this type of data structure. >>As for the implementation details, there is already a generic hashmap implementation built-in .(described in api-hashmap.txt).I could roll out a new one if you want. >>Unsetting previously set value would become as easy as deleting the key-value pair ,or setting a invalid flag for that pair during file parsing stage. #Complications in using a hashmap:- ----------------------------------- >>To report errors in the config files, a node remembering where the key-value pair came from and from which file, will have to be maintained. It would also be useful in writing key-value pairs back to the file in order of reading. >>In the current code, when somebody does git_config_set() and then later uses git_config() to grab the value of the variable set with the first call, we will read the value written to the file with the first call. With the proposed change, if we parse from the file upfront, callers to git_config_set() will need to somehow invalidate that stale copy in memory. We could do two things about it, 1>Easy approach- Discard and rebuilt the cache every time git_config queries for a value. This is currently, what we are doing, reading and parsing the file every time, a value is called for using git_config. 2>Harder approach- Update the changed part. This seems doable, but I will have to experiment with it to see if it works or not as I am not too much familiar with the code base. But take this initial proposal, git_config_set creates a lock file , updates it, deletes the original config file and renames the lockfile to a new config file. We could take this approach and apply it to the in memory-cache and lock it until git_config_set works on it. >>Precedence order of variables read from different sources. Currently the config is read from multiple sources (repo config file `.git/config`,user config file `$HOME/.gitconfig` and system wide config file `/etc/gitconfig`) with the repo config file having the highest preceding order. It is read last in the current parsing order. In my implementation the value of variables will be implemented as list of strings. So it will be a case of "last one wins", where the value with the highest precedence would be inserted last in the list. So a caller asking for a value would just read the last value in the list without worrying about no allocation or ownership complexity. #Changing the parsing code -------------------------- Currently git_parse_source parses the file and feeds each value to the callback function. git_config uses it. To be backward compatible, I would change the git_parse_source to have a switch for the default parsing behaviour and updated parsing parsing behaviour. In the updated one, the file will be read only once to construct the hash map and after that it would iterate over the values querying the cache each time. The callback would be fed these read values only. #Add new API Calls ------------------ A new set of helpers for retrieving values in a non-callback way will be added. Possible implementation of the API, >>extern int get_config_string_multi(const char * variable, int * num_values, const char **values) Argument list: key: used for retrieving the value list for the respective key num_values: returns no of values in the list vales: list containing the values as string >> A singleton wrapper would look like char *git_get_config_string(const char *name) return the last value for the key "name"(thus implementing the last one wins semantic) Possible implementation (copied directly from the mailing list discussion ) const char *git_get_config_string(const char *name) { const char **values, *result; int num_values; if (git_get_config_string_multi("sample.str", &num_values, &values)) return NULL; result = num_values ? values[num_values - 1] : NULL; free(values); return result; } As there is already a set of helpers for validating the values and converting them to appropriate data types in config.c there won't be any need for specialized functions for each data type. #Rewrite callers: ---------------- After implementing the above ,this would be just busywork and pretty much doable.The callbacks would have to be reworked to use the new api functions. ---------------------------------------------------------------------- ##Tidy configuration files ---------------------------------------------------------------------- #First case: When a configuration file is repeatedly modified, often garbage is left behind. For example, after git config pull.rebase true git config --unset pull.rebase git config pull.rebase true git config --unset pull.rebase the bottom of the configuration file is left with the useless lines [pull] [pull] #Case 2 Also,setting a config value, appends the key-value pair at the end of file without checking for matching empty main section headers. It works fine if the main section header has an already present sub-key. for example:- git config pull.rebase true git config --unset pull.rebase git config pull.rebase true git config pull.option true gives [pull] [pull] rebase = true option = true #Case 3 Also, a possible detriment is presence of comments, For Example:- [my] # This section is for my own private settings Another example you need to consider is # Comment regarding the entire [my] section [my] # Comment on the item foo foo = true # Comment on the item bar bar = false What should "git config --unset my.foo && git config -unset my.bar" do? What if these are done in a different order? #Expected output: 1. When we delete the last key in a section, it automatically delete the section header.(This is debatable) 2. When we add a key into a section, we should be able to reuse the same section header, even if that section did not have any keys in it already(empty section header). #Suggested Approaches :- -------------------------------------- Case first Leave the empty section header as it was and when a new value is set, reuse the header instead of appending at the end of the config file. This can be implemented by changing git_config_set a little, even without using the internal cache. If we use the internal cache,to rewrite the file after setting or unsetting the value, this problem will vanish automatically. Case second This is also a git_config_set problem and can be solved by either using the cache to rewrite the file every time setting or unsetting the value, or reuse the empty section header when adding a new value pair into it. Case three Its a controversial issue , in which the comments added by the user might signify anything(commenting on section header or a particular variable or the above header).I think the comments should be left as it is , as some users leave empty section headers and comments for future use. So removing them may prove detrimental.But , other suggestions are welcome for this problem .The mailing list itself was conflicted about this option. links:- [1]http://thread.gmane.org/gmane.comp.version-control.git/219505 [2]http://thread.gmane.org/gmane.comp.version-control.git/208113 ------------------------------------------------------------------------------------------------------------------ Cheers, Tanay Abhra. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html