Re: [GSoC] Applying for conversion scripts to builtins

Johannes Schindelin <johannes.schindelin@xxxxxx> · Tue, 17 Mar 2015 12:56:15 +0100

Hi Paul,

On 2015-03-17 01:22, Paul Tan wrote:

> On Tue, Mar 17, 2015 at 12:49 AM, Yurii Shevtsov <ungetch@xxxxxxxxx> wrote:
>
> Generally, it would be easy to convert any shell script to C by just
> using the run_command* functions (and in less lines of code), but that
> would not be taking advantage of the potential benefits in porting
> shell scripts to C. To summarize the (ideal) requirements:
> 
> * zero spawning of processes so that the internal object/config/index
> cache can be taken advantage of. (and to avoid the process spawning
> overhead which is relative large in e.g. Windows)

Spawning definitely uses up many more resources on Windows.

However, spawning a full-fledged Bash requires MSys (or soon MSys2) to spin up an entire POSIX emulation layer. This costs us dearly. For example, when I run the t3404 test (which exercises scripting heavily, what with `git rebase -i` being implemented as a shell script) on MacOSX, it takes roughly a minute to complete. On a comparable Windows machine, it takes roughly 12 minutes to complete.

Therefore, I would wager a bet that just the mere conversion of a shell script into even a primitive `run_command()`-based builtin would help performance on Windows in a noticeable manner.

Of course, it would be *even nicer* to avoid the spawning altogether.

> * avoid needless parsing since we have direct access to the C data
> structures.

True that. Turning SHA-1s into strings, spawning, and reparsing the same SHA-1 is quite a lot of unnecessary churn.

The biggest benefit of avoiding needless parsing, however, is not performance. It is avoiding quoting issues. This is particularly so on Windows, where Git is sometimes called from outside a shell environment, where we have to deal with inconsistent quoting because it is every Windows program's own job to parse the command-line, including the quoting.

> * use the internal API as much as possible: share code between the
> builtins (e.g. fmt-merge-msg.c, exposed in fmt-merge-msg.h) in order
> to reduce code complexity.

That is definitely something that even the Git maintainer should be interested in (he does not touch Windows, therefore the performance differences do not concern him): by sharing code paths between different subcommands, you ensure that you have to fix problems only once, not twice or more.

Concrete example: on Windows, we have file locking issues because files that are in use cannot be deleted. For that reason, we have Windows-specific code that is "nice" by trying harder to delete files, giving programs a little time to let their locks go. This locking issue happens also when a virus scanner "uses", say, the .git-rewrite/revs file that was written by `git filter-branch`, while said shell script already wants to delete the file because it is obsolete. If `git filter-branch` were a builtin, the bug would already be fixed due to our override of the `unlink()` function in C. Now we have to fix that bug separately because `filter-branch` is a shell script.

> The biggest wins would definitely be portability, but there may be
> performance improvements, though they are theoretical at this point.
> 
> I'm not exactly sure if the above requirements are sane, which is why
> I'm also CC-ing Dscho who knows the problems of git on Windows more
> than I do.

Thanks for bringing this to my attention. I hope I managed to add useful information to the discussion.

Ciao,
Johannes
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html