Since the deadline is fast approaching, and I've read that google-melange usually becomes overwhelmed near the deadline, I'll try to iterate on the proposal as much as possible. Below is v2, mostly small changes in response to Matthieu's and Junio's reviews. The changes are as follows: * Make it clear that zero spawning of processes is an ideal -- it doesn't have to be so in practice. * Swap rewrite of git-pull and git-am in timeline. It is better to push the first patch to the mailing list as soon as possible. * Make it clear that as part of refactoring, commonly recurring patterns can be codified and implemented in the internal git API. * Add microproject v5. * Make it clear that Windows is not the only one that has poor IO performance. Poor IO performance can stem from the choice of operating system, filesystem and underlying storage performance. * Cite filesystem cache feature in git for windows. * Remove section-numbering directive, github messes it up. * State what the end-product of the final stage is. The updated version is also in the gist[1]. [1] https://gist.github.com/pyokagan/1b7b0d1f4dab6ba3cef1 -->8-- ======================================== Make `git-pull` and `git-am` builtins ======================================== :Abstract: `git-pull` and `git-am` are frequently used git subcommands. However, they are porcelain commands and implemented as shell scripts, which has some limitations which can cause poor performance, especially in non-POSIX environments like Windows. I propose to rewrite these scripts into low level C code and make them builtins. This will increase git's portability, and may improve the efficiency and performance of these commands. .. section-numbering:: Limitations of shell scripts ============================= `git-pull` is a commonly executed command to check for new changes in the upstream repository and, if there are, fetch and integrate them into the current branch. `git-am` is another commonly executed command for applying a series of patches from a mailbox to the current branch. They are both git porcelain commands -- they have no access to git's low level internal API. Currently, they are implemented by the shell scripts ``git-pull.sh`` and ``git-am.sh`` respectively. These shell scripts require a fully-functioning POSIX shell and utilities. As a result, these commands are difficult to port to non-POSIX environments like Windows. Since porcelain commands do not have access to git's internal API, performing any git-related function, no matter how trivial, requires git to be spawned in a separate process. This limitation leads to these git commands being relatively inefficient, and can cause long run times on certain platforms that do not have copy-on-write ``fork()`` semantics. Spawning processes can be slow ------------------------------- Shell scripting, by itself, is severely limited in what it can do. Performing most operations in shell scripts require external executables to be called. For example, ``git-pull.sh`` spawns the git executable not only to perform complex git operations like `git-fetch` and `git-merge`, but it also spawns the git executable for trivial tasks such as retrieving configuration values with `git-config` and even quoting of command-line arguments with ``git rev-parse --sq-quote``. As a result, these shell scripts usually end up spawning a lot of processes. Process spawning is usually implemented as a ``fork()`` followed by an ``exec()`` by shells. This can be slow on systems that do not support copy-on-write semantics for ``fork()``, and thus needs to duplicate the memory of the parent process for every ``fork()`` call -- an expensive process. Furthermore, starting up processes on Windows is generally expensive as it performs `several extra steps`_ such as such as using an inter-process call to notify the Windows Client/Server Runtime Subsystem(CSRSS) about the process creation and checking for App Compatibility requirements. .. _`several extra steps`: http://www.microsoft.com/mspress/books/sampchap/4354a.aspx The official Windows port of git, Git for Windows, uses MSYS2 [#]_ to emulate ``fork()``. Since Windows does not support forking semantics natively, MSYS2 can only emulate ``fork()`` `without copy-on-write semantics`_. Coupled with Windows heavy process creation, this causes huge slowdowns of git on Windows. .. _`without copy-on-write semantics`: https://www.cygwin.com/faq.html#faq.api.fork A "no-updates" `git-pull`, for example, takes an average of 5.1s [#]_, as compared to Linux which only takes an average of 0.08s. 5 seconds, while seemingly short, would seem like an eternity to a user who just wants to quickly fetch and merge changes from upstream. `git-am`'s implementation reads each patch from the mailbox in a while loop, spawning many processes for each patch. Considering the cost of spawning each process, as well as the fact that runtime grows linearly with the number of patches, git-am takes a long time to process a seemingly small number of patches on Windows as compared to Linux. A quick benchmarks shows that `git-am` takes 7m 20.39s to apply 100 patches on Windows, compared to Linux, which took only 0.08s. Commands which call `git-am` are also affected as well. ``git-rebase--am.sh``, which implements the default, patch-based, non-interactive rebase of the `git-rebase` command, is implemented by first calling `git-format-patch` (a C builtin) to generate the patches of commits to be rebased, and then applying the commits on top of the new head using git-am. On Windows, the execution time of `git-am` currently takes up around 98% of `git-rebase`'s execution time for non-interactive rebases [#]_. If `git-am` was made to perform faster, `git-rebase` would benefit greatly. .. [#] The current Git for Windows is based on MSYS, but shares many of the limitations of MSYS2 as they are both derived from cygwin. This proposal is written with MSYS2 in mind, as Git For Windows will use it in the future. .. [#] The `git-pull` from a local remote with no updates. As a benchmark, `git-pull` was run with this setup 100 times on Windows and Linux boxes with similar specs. `git-pull` was executed from a native process on Windows to avoid MSYS2's forking overhead. .. [#] git-rebase was used to rebase 100 commits in non-interactive mode with ``GIT_TRACE`` on. The timings were taken from the resulting output and matched up with the code in ``git-am.sh`` and ``git-rebase--am.sh``. The total execution time was 8m 0s, of which `git-format-patch` took only 0.5s, but `git-am` took 7m 52s. Scripting languages introduce more dependencies ------------------------------------------------- Scripting languages, while allowing for faster prototyping of commands, introduce dependencies into the git distribution. The problem is greater with shell scripts which, due to their simple nature, need to call other executables to perform non-trivial tasks. `git-pull` and `git-am` combined requires `test`, `printf`, `echo`, `sed`, `tr`, `grep`, `cat`, `mkdir`, `rm`, `mv`, `perl` etc. These may seem like common POSIX utilities, but when porting git to non-POSIX platforms, these commands must all be ported and installed on the system in its entirety. Git for Windows, for example, requires around 250MB of disk space for its installation as it includes the MSYS2 shell and utilities, as well as language runtimes for Perl and Tcl/Tk. Furthermore, should any workarounds be required to support the platform, it must be implemented in all of these dependency libraries and executables. Git, for example, implements a compatibility function for ``unlink()`` on Windows that deals with failures that can occur if another process has an open handle on the file [#]_. However, this only helps git builtins, and will have to be implemented again on all the other dependency shell utilities. Thus, shell scripts do not benefit from this compatibility function, and will have to workaround file deletion failures separately, leading to lots of duplicated code. .. [#] See ``mingw_unlink()`` in ``compat/mingw.c``. Scripts have inefficiencies ---------------------------- As stated above, porcelain do not have access to the git internal API, and thus have to spawn git processes in order to perform any git-related operation, no matter how trivial it is. In every git invocation, git would have to re-read the user's configuration files, repository index, repopulate the filesystem cache [#]_ etc. This leads to a lot of unnecessary IO, which on systems with poor IO performance [#]_ will lead to runtime overhead as well. Git has internal caches for configuration values, the repository index and repository objects. By rewriting the porcelain into C builtins and using the git internal API instead of spawning separate git processes, these caches can be taken advantage of to reduce redundant IO accesses. .. [#] fscache, currently only implemented in Git for Windows. See commit `99943de`_. .. _`99943de`: https://github.com/git-for-windows/git/commit/99943de .. [#] Poor IO performance can stem from the choice of operating system, filesystem as well as the underlying storage medium. Windows, for example, is known to have worse IO performance than Linux, which git was historically primarily written for. Network file systems such as NFS can also cause poor IO performance due to network latency. IO performance of HDDs will be lower than SSDs due to slower random access and higher latency. Rewriting shell scripts to C ============================= In order to eliminate the aforementioned portability and performance limitations, I propose that these scripts be rewritten into C builtins. The C programming language provides a richer set of data types, and also provides access to git's internal API. Thus, the above problems can be solved: * git is a native executable on Windows, and thus does not require MSYS2 POSIX emulation. Spawning processes will not incur the overhead of MSYS2 fork emulation. * Converting shell scripts into builtins will reduce the git distribution's dependence on other software. In the future, if all core git functionality is converted into C, it would greatly ease porting of git to platforms which only require the core subset of git's functionality. * Taking advantage of git's internal API and C data types can remove the inefficiencies inherent in shell scripts. For example, instead of spawning a `git-config` process to get a configuration value, ``git_config_get_value()`` can be used. This function also takes advantage of git's configuration cache, which means that configuration files only need to be read once to access all of the configuration values, thus reducing IO. In addition, `git-pull` and `git-am`'s behavior can also become more consistent with other git builtin through the git internal API. Currently, command line arguments in `git-pull` and `git-am` are parsed manually with a while loop and pattern matching. As a result, their command line interface is inconsistent with other git builtins, which use the ``parse-options`` API. For example, in `git-pull` the option ``--st`` should thrown an ambiguous option error as it could mean ``--stat`` or ``--strategy``, but ``git-pull.sh`` is hardcoded to assume ``--st`` resolves to ``--strategy``. `git-pull` and `git-am` also do not provide usage information with the ``-h`` flag. Rewriting these commands to use ``parse-options`` would make their command line interfaces consistent with other git builtins. In order to fully take advantage of the above benefits of implementing `git-pull` and `git-am` in C, especially with regards to making the resulting C code as efficient and fast as possible, I propose the following requirements for the rewritten code: 1. Ideally [#]_, no spawning of external git processes. This is to support systems with high ``fork()`` or process creation overhead, and to reduce redundant IO by taking advantage of the internal object, index and configuration cache. 2. Avoid needless parsing/string conversions by directly accessing the C data structures. Massaging of strings cause too much code churn, so functions that take the richer C data types as input should be preferred over functions that work by parsing string arguments. Overall, rewriting git-pull and git-am into efficient C code will most likely lead to impressive speedups on platforms most affected by the aforementioned limitations. The `prototype rewrite of git-pull`_, for example, takes only an average of 0.63s on the same "no-updates" benchmark. Compared to the shell script which took an average of 5.1s, this is nearly an 8x improvement in execution time. .. _`prototype rewrite of git-pull`: http://thread.gmane.org/gmane.comp.version-control.git/265628 On the other hand, the rewritten code should not cause regressions in the command as compared to the original scripts as the time wasted on dealing with the bugs may outweigh any advantages of rewriting these scripts, especially on platforms which do not suffer so much from the limitations of scripts. Therefore, I propose the following additional requirements: 3. The resulting builtin should not have wildly different behavior or bugs compared to the shell script. 4. The rewritten code should not be much harder to maintain than the original script. These requirements leads to some potential difficulties in rewriting `git-pull` and `git-am`. .. [#] This requirement needs to be balanced with the fact that significant refactoring of the git codebase may be required to access certain functionality without spawning a separate process. (e.g. ensuring errors are propogated instead of ``die()``-ing). Potential difficulties ======================= Rewriting code may introduce bugs ---------------------------------- A dangerous possibility of rewriting code is that bugs may be introduced into the rewritten version without detection. The test suite can help in catching some bugs, however there is still the possibility that the test suite may not be comprehensive enough. Generally, the ideal method of catching errors and bugs is through careful rewriting and review of the code. However, such reviews would require a lot of time and effort. In order to ensure that the rewrite can be completed during the GSoC period, it would be useful if the test suite could provide some sort of guarantee that the rewritten code *most probably* did not introduce any bugs. For this purpose, code coverage tools can be used to ensure that the test suite for the command tests all major code paths in the rewritten built-in. We can also avoid introducing logical errors by following the logical structure of the original script in the rewritten code. This means largely doing a line-by-line, one-to-one translation of the shell script into C. For instance, this line in ``git-pull.sh`` calls `git-merge-base` to calculate the point in history in which ``$curr_branch``, the current branch, forked from the upstream remote tracking branch, ``$remoteref``:: oldremoteref=$(git merge-base --fork-point "$remoteref" $curr_branch 2>/dev/null) Assuming ``remoteref`` and ``curr_branch`` are strings, the equivalent C code, using the ``run-command`` API, would look like:: struct child_process cmd = CHILD_PROCESS_INIT; struct strbuf oldremoteref = STRBUF_INIT; argv_array_pushl(&cmd.args, "merge-base", "--fork-point", remoteref, curr_branch, NULL); cmd.out = -1; cmd.no_stderr = 1; cmd.git_cmd = 1; if (!start_command(&cmd)) { strbuf_read(&oldremoteref, cmd.out, 0); close(cmd.out); finish_command(&cmd); } Of course, the downside of following this too strictly is that if there were any logical bugs in the original code, or if the original code is unclear, the rewritten code would inherit these problems too. For example, the code in ``git-pull.sh`` assumes in certain code paths that the repository and refspecs to fetch are the only arguments present in ``"$@"``. However, the ``"$@"`` automatic variable would also contain the command-line options to be passed to `git-fetch` as well [#]_. The rewritten `git-pull` should, for the purpose of making code maintenance easy, fix these logical errors. .. [#] An example of such an assumption can be found in ``remoteref="$(get_remote_merge_branch "$@" 2>/dev/null)"``. ``get_remote_merge_branch`` assumes that the first argument is the remote name, and will thus fail (silently) if it ends up being arguments for git-fetch. Rewritten code may become harder to understand ----------------------------------------------- For the purpose of reducing git's dependencies, the rewritten C code should not depend on other libraries or executables other than what is already available to git builtins. However, C is a much more verbose language than shell script, and does not provide much advanced facilities for processing text. As a result, text processing functions in shell scripts can end up as huge multi-line functions in C. For instance, in ``git-pull.sh``, by relying on `sed` and `tr`, all heads valid for merge from ``FETCH_HEAD`` can be retrieved with a single short-and-succinct pipeline:: merge_head=$(sed -e '/ not-for-merge /d' \ -e 's/ .*//' "$GIT_DIR"/FETCH_HEAD | \ tr '\012' ' ') A strict translation of the pipeline to C becomes something like:: struct strbuf merge_head = STRBUF_INIT; FILE *fp = fopen(git_path("FETCH_HEAD"), "r"); for (;;) { char sha1_text[40]; char text[128]; int c, ret; ret = fscanf(fp, "%40c %127[^\n]", sha1_text, text); if (ret == EOF) break; else if (ret != 2) die(_("error parsing %s"), filename); for (;;) { c = fgetc(fp); if (c == '\n' || c == EOF) break; } if (strstr(text, "not-for-merge")) continue; strbuf_addstr(merge_head, sha1_text); strbuf_addstr(merge_head, " "); } fclose(fp); We can see that the C version requires much more lines compared to the shell pipeline, which will contributes to the "bloating" of the C code. It may even seem to be more cryptic. This is a known trade-off of the C programming language, and not depending on external libraries and dependencies. The best solution is to improve the design of the rewritten code. For example, the above can be moved into an appropriately-named function. The function can then be documented with well-defined inputs, outputs, and behavior. This should help the resultant code to be much more understandable. Recurring code patterns, once identified, can be codified and implemented in the internal git API itself. This not only helps to shrink the C code, but it can also benefit the rest of the git code base. Development approach and deliverables ====================================== In the first stage (and throughout the course of the project), test suite code coverage for each command is reviewed using `kcov`_ [#]_. At the minimum, the test suite should cover the major code paths and command-line switches of the script. These tests do not necessarily have to be perfect or comprehensive before the next stage can begin [#]_, however the minimum required tests must be identified and written in order to give a good idea of what needs to be tested, and provides a safety net for the later stages. Tests will be kept in a separate patch queue as they are applicable whether `git-pull` and `git-am` are builtins or not. By the end of the project, the test suite for `git-pull` and `git-am` will be more comprehensive. .. _kcov: http://simonkagstrom.github.io/kcov/ In the second stage, the shell scripts are translated line-by-line into their corresponding C code. The translation should not be overly strict -- where appropriate the corresponding git internal API functionality can be used instead. For example, in `git-pull` and `git-am` the ``parse-options`` API can be used instead of manually implementing command-line argument parsing like how the shell scripts currently do. If the shell script has any obvious bugs or errors (perhaps identified in the first stage), they should be fixed in this stage as well. The resultant C code at the end of this stage may not necessarily be efficient, but it should be functionally complete and error free to serve as a baseline product for the final stage. The final product would most probably look something like the `prototype rewrite of git-pull`_, but with more usage of the run-command APIs and without the bugs. .. _`prototype rewrite of git-pull`: http://thread.gmane.org/gmane.comp.version-control.git/265628 In the final stage, the baseline C code will be optimized for speed and efficiency. For prioritization of efforts, the baseline builtins will be profiled using their respective test suites in order to find hotspots. The builtins will also be benchmarked to compare their performance with the original shell scripts. Since `git-pull` is just a simple wrapper around `git-fetch`, `git-merge` and `git-rebase`, the benchmark will simply concentrate on run time for common use cases. (e.g. the "no updates" use case). For `git-am`, the benchmark can rely on its performance when applying a large number of patches (e.g. a `patchbomb benchmark`_). .. _`patchbomb benchmark`: http://thread.gmane.org/gmane.comp.version-control.git/1681/focus=2369 Should there be no hotspots found, general optimization will be done to reduce overhead. ``run-command`` API calls will be replaced with the relevant corresponding git internal API calls. Strings will be replaced with their corresponding git C data types where possible to avoid needless string conversions. For example, SHA1s will be parsed once and stored in ``unsigned char[20]`` or ``sha1_array`` data types. Some refactoring may also be required. For example, for `git-pull` to access ``git-merge-base --fork-point`` functionality without spawning a separate process, the function ``handle_fork_point()`` in ``builtin/merge-base.c`` must be made available. The function must also be modified to not call ``die()`` on errors. In order to help speed up polishing and code review, this optimization will be done iteratively in small patches on top of the baseline C code produced in stage 2. The final product of this stage, and the entire project, would be a patch series containing the baseline C code as the first patch, and optimizations to it in subsequent patches. .. [#] An example kcov code coverage report for the git-pull tests can be found `here`_ .. _here: https://83dc3588620c5a97164396ec753a2fa480f6a7b0-www.googledrive.com/host/0B4O2AiYulllpfmJlTW4xT050OVVicnNWWS02dm52aTJ2TFIwQ2QwdWh0VHotSkU4eUNNWjg/git-pull/git-pull.360f32c2.html .. [#] In fact, improvements to the test suite will most likely proceed in parallel with the rest of the project as the process of studying and rewriting the shell scripts will provide insights on what needs to be tested. Timeline ========= Unfortunately, my university term begins on the 3rd of August. On the other hand, though, I am currently not attending university so I'm effectively free before the Summer holidays even begin. As such, to compensate, I will begin coding right after accepted student proposals are announced so that the period for coding is still 3 months. The timeline is as follows: * Apr 27: Accepted student proposals announced. * Apr 27: Stage 1 begins: Review of test suite coverage. * Apr 27 onwards: Discussion of any bugs or behavior in `git-am` and `git-pull` that should be fixed or changed. This will continue throughout the project. * May 4: Stage 2 (Baseline rewrite) begins. * May 4 - May 10 (1 week): Full baseline rewrite of ``git-pull.sh``. * May 11: 1st version of ``builtin/pull.c`` published on mailing list for review. * May 11 - May 24 (2 weeks): Full baseline rewrite of ``git-am.sh``. * May 25: 1st version of ``builtin/am.c`` published on mailing list for review. * May 25 - Jun 22 (5 weeks): ``builtin/am.c`` and ``builtin/pull.c`` are iterated on in an alternating fashion to pass code review. * Jun 22: Stage 3 (Optimization) begins * Jun 22 - Jun 24: Benchmarking and profiling is done on the new builtins, especially as compared to their original shell scripts, to check for any performance improvements and to guide optimization. These results are published to mailing list and discussed. * Jun 24 - Jul 19 (3.5 weeks): Stage 3 (Optimization). ``builtin/am.c`` and ``builtin/pull.c`` are iterated on in an alternating fashion to optimize the code in small steps. * Jul 3: Mid-term evaluations deadline * Jul 20 - Aug 2 (2 weeks): Pencils down -- focus on polishing up ``builtin/am.c``, ``builtin/pull.c`` and the tests patch series, and get them to pass code review. * Aug 17: Suggested 'pencils-down' date. * Aug 21: Firm 'pencils-down' date. * Aug 28: Final evaluation deadline About me ========= I've been accepted to the `National University of Singapore`_ for a bachelors in Computer Engineering, and will matriculate on 3rd August 2015. I've been writing Python, C and C++ for around 4 years, and some of my experiments can be found on my `github profile`_. I have also contributed patches successfully to `libuv`_ and `node-contextify`_. My microproject is to implement ``XDG_CONFIG_HOME support`` in git-credential-store. Patch series: `v5`_, `v4`_, `v3`_, `v2`_, `v1`_. .. _`National University of Singapore`: http://www.ceg.nus.edu.sg/ .. _`github profile`: https://github.com/pyokagan .. _`libuv`: https://github.com/joyent/libuv/commit/347000929de775ae6b975a2959f5fabe81d6f1f0 .. _`node-contextify`: https://github.com/brianmcd/contextify/commit/7e3e0a89779cd892b3d8604795b6675d6aecff17 .. _v1: http://thread.gmane.org/gmane.comp.version-control.git/264682 .. _v2: http://thread.gmane.org/gmane.comp.version-control.git/265042 .. _v3: http://thread.gmane.org/gmane.comp.version-control.git/265305 .. _v4: http://thread.gmane.org/gmane.comp.version-control.git/265683 .. _v5: http://thread.gmane.org/gmane.comp.version-control.git/266175 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html