Re: Git in Outreachy December 2019?

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Tue, 17 Sep 2019 13:23:18 +0200 (CEST)

Hi Emily,

On Mon, 16 Sep 2019, Emily Shaffer wrote:

> Jonathan Tan, Jonathan Nieder, Josh Steadmon and I met on Friday to
> talk about projects and we came up with a trimmed list; not sure what
> more needs to be done to make them into fully-fledged proposals.

Thank you for doing this!

> For starter microprojects, we came up with:
>
>  - cleanup a test script (although we need to identify particularly
>    which ones and what counts as "clean")
>  - moving doc from documentation/technical/api-* to comments in the
>    appropriate header instead
>  - teach a command which currently handles its own argv how to use
>    parse-options instead
>  - add a user.timezone option which Git can use if present rather than
>    checking system local time

Nice projects, all. There are a couple more ideas on
https://github.com/gitgitgadget/git/issues, they could probably use some
tagging.

> For the longer projects, we came up with a few more:
>
>  - find places where we can pass in the_repository as arg instead of
>    using global the_repository

Good project, if a bit boring ;-) Also, `the_index` is used a lot in
`builtin/*.c`, still.

>  - convert sh/pl commands to C, including:
>    - git-submodules.sh

I am of two minds there. But mostly, I am of the "friends don't let
friends use submodules" camp, so I would not even want to mentor for
this project: it just makes me shudder too much every time I have to
work with/on submodules.

Of course, if others are interested, I'd hardly object to turn this into
a built-in.

>    - git-bisect.sh

That would be my top recommendation, especially given how much effort
Tanushree put in last winter to make this conversion to C so much more
achievable than before.

>    - rebase --preserve-merges

No. `rebase -p` is already deprecated in favor of `rebase -r` (which
_is_ already built-in).

I already have patches lined up to drop that rebase backend. Let's not
waste effort on converting this script to C.

>    - add -i

Please see PRs #170-#175 on https://github.com/gitgitgadget/git/pulls,
and please do help by adding your review of #170 (which was already
submitted as v4:
https://public-inbox.org/git/pull.170.v4.git.gitgitgadget@xxxxxxxxx/

In other words: this project is well under way. In fact, Git for Windows
users enjoy this as an opt-in already.

>    (We were afraid this might be too boring, though.)

Converting shell/Perl scripts into built-in C never looks as much fun as
open-ended projects with lots of playing around, but the advantage of
the former is that they can be easily structured, offer a lot of
opportunity for learning, and they are ultimately more rewarding because
the goals are much better defined than many other projects'.

Another script that would _really_ benefit from being converted to C:
`mergetool`. Especially on Windows, where the over-use of spawned
processes really hurts, it is awfully slow a command.

To complete the list of sh/pl commands:

- git-merge-octopus.sh
- git-merge-one-file.sh
- git-merge-resolve.sh

These seem to be good candidates for conversion to built-ins. Their
functionality is well-exercised in the test suite, their complexity is
quite manageable, and there is no good reason that these should be
scripted.

The only slightly challenging aspect might be that `merge-one-file` is
actually not a merge strategy, but it is used as helper to be passed to
`git merge-index` via the `-o <helper>` option, which makes it slightly
awkward to be implemented as a built-in. A better approach would
therefore be to special-case this value in `merge-index` and execute the
C code directly, without the detour of spawning a built-in.

- git-difftool--helper.sh
- git-mergetool--lib.sh

These would be converted as part of making `mergetool` a built-in, I
believe.

- git-filter-branch.sh

This one is in the process of being deprecated in favor of `git
filter-repo` (which is an external tool), so I don't think there would
be much use in wasting energy on trying to convert it to C. Especially
given that it wants to call shell script snippets all over the place,
and those shell script snippets are supposed to run in the same context,
which might actually make it completely impossible to convert this to C
at all.

- git-legacy-stash.sh

This will go away once the built-in stash is considered good enough.

- git-instaweb.sh
- git-request-pull.sh
- git-send-email.perl
- git-web--browse.sh

I don't think that any of these should be converted. They are just too
unimportant from a performance point of view, and obscure enough that
even their portability issues don't matter too much.

As to `send-email` in particular: I would not want anybody to drag in
all the dependencies required to convert `send-email` to a built-in to
begin with.

- git-archimport.perl
- git-cvsexportcommit.perl
- git-cvsimport.perl
- git-cvsserver.perl
- git-quiltimport.sh
- git-svn.perl

These are all connectors of some sort to other version control software.
It also feels like they become less and less important, as Git really
takes over the world.

At some stage, I think, it would make sense to push those scripts out
into their own repositories, looking for new maintainers (and if none
can be found, then there really is not enough need for them to begin
with, and they can be archived).

>  - reduce/eliminate use of fetch_if_missing global
>  - create a better difftool/mergetool for format of choice (this one
>    ends up existing outside of the Git codebase, but still may be pretty
>    adjacent and big impact)
>  - training wheels/intro/tutorial mode? (We thought it may be useful to
>    make available a very basic "I just want to make a single PR and not
>    learn graph theory" mode, toggled by config switch)
>  - "did you mean?" for common use cases, e.g. commit with a dirty
>    working tree and no staged files - either offer a hint or offer a
>    prompt to continue ("Stage changed files and commit? [Y/n]")
>  - new `git partial-clone` command to interactively set a filter,
>    configure other partial clone settings
>  - add progress bars in various situations
>  - add a TUI to deal more easily with the mailing list. Jonathan Tan has
>    a strong idea of what this TUI would do... This one would also end up
>    external but adjacent to the Git codebase.

I don't think that this would be a good project for anybody except
people who are already really, really familiar with our mailing
list-centric workflow.

>  - try and make progress towards running many tests from a single test
>    file in parallel - maybe this is too big, I'm not sure if we know how
>    many of our tests are order-dependent within a file for now...

Another, potentially more rewarding, project would be to modernize our
test suite framework, so that it is not based on Unix shell scripting,
but on C instead.

The fact that it is based on Unix shell scripting not only costs a lot
of speed, especially on Windows, it also limits us quite a bit, and I am
talking about a lot more than just the awkwardness of having to think
about options of BSD vs GNU variants of common command-line tools.

For example, many, many, if not all, test cases, spend the majority of
their code on setting up specific scenarios. I don't know about you,
but personally I have to dive into many of them when things fail (and I
_dread_ the numbers 0021, 0025 and 3070, let me tell you) and I really
have to say that most of that code is hard to follow and does not make
it easy to form a mental model of what the code tries to accomplish.

To address this, a while ago Thomas Rast started to use `fast-export`ed
commit histories in test scripts (see e.g. `t/t3206/history.export`). I
still find that this fails to make it easier for occasional readers to
understand the ideas underlying the test cases.

Another approach is to document heavily the ideas first, then use code
to implement them. For example, t3430 starts with this:

	[...]

	Initial setup:

	    -- B --                   (first)
	   /       \
	 A - C - D - E - H            (master)
	   \    \       /
	    \    F - G                (second)
	     \
	      Conflicting-G

	[...]

	test_commit A &&
	git checkout -b first &&
	test_commit B &&
	git checkout master &&
	test_commit C &&
	test_commit D &&
	git merge --no-commit B &&
	test_tick &&
	git commit -m E &&
	git tag -m E E &&
	git checkout -b second C &&
	test_commit F &&
	test_commit G &&
	git checkout master &&
	git merge --no-commit G &&
	test_tick &&
	git commit -m H &&
	git tag -m H H &&
	git checkout A &&
	test_commit conflicting-G G.t

	[...]

While this is _somewhat_ better than having only the code, I am still
unhappy about it: this wall of `test_commit` lines interspersed with
other commands is very hard to follow.

If we were to (slowly) convert our test suite framework to C, we could
change that.

One idea would be to allow recreating commit history from something that
looks like the output of `git log`, or even `git log --graph --oneline`,
much like `git mktree` (which really should have been a test helper
instead of a Git command, but I digress) takes something that looks like
the output of `git ls-tree` and creates a tree object from it.

Another thing that would be much easier if we moved more and more parts
of the test suite framework to C: we could implement more powerful
assertions, a lot more easily. For example, the trace output of a failed
`test_i18ngrep` (or `mingw_test_cmp`!!!) could be made a lot more
focused on what is going wrong than on cluttering the terminal window
with almost useless lines which are tedious to sift through.

Likewise, having a framework in C would make it a lot easier to improve
debugging, e.g. by making test scripts "resumable" (guarded by an
option, it could store a complete state, including a copy of the trash
directory, before executing commands, which would allow "going back in
time" and calling a failing command with a debugger, or with valgrind, or
just seeing whether the command would still fail, i.e. whether the test
case is flaky).

Also, things like the code tracing via `-x` (which relies on Bash
functionality in order to work properly, and which _still_ does not work
as intended if your test case evaluates a lazy prereq that has not been
evaluated before) could be "done right".

In many ways, our current test suite seems to test Git's functionality
as much as (core) contributors' abilities to implement test cases in
Unix shell script, _correctly_, and maybe also contributors' patience.
You could say that it tests for the wrong thing at least half of the
time, by design.

It might look like a somewhat less important project, but given that we
exercise almost 150,000 test cases with every CI build, I think it does
make sense to grind our axe for a while, so to say.

Therefore, it might be a really good project to modernize our test
suite. To take ideas from modern test frameworks such as Jest and try to
bring them to C. Which means that new contributors would probably be
better suited to work on this project than Git old-timers!

And the really neat thing about this project is that it could be done
incrementally.

> It might make sense to only focus on scoping the ones we feel most
> interested in. We came up with a pretty big list because we had some
> other programs in mind, so I suppose it's not necessary to develop all
> of them for this program.

I don't find that list particularly big, to be honest ;-)

Ciao,
Dscho