Re: [RFC v3][GSoC] Project proposal: convert interactive rebase to C

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,
this is the first draft of my proposal.

---
ABSTRACT

git is a modular source control management software, and all of its
subcommands are programs on their own. A lot of them are written in C,
but a couple of them are shell or Perl scripts. This is the case of
=git-rebase--interactive= (or interactive rebase), which is a shell
script. Rewriting it in C would improve its performance, its
portability, and maybe its robustness.


ABOUT `git-rebase` AND `git-rebase--interactive`

git-rebase allows to re-apply changes on top of another branch. For
instance, when a local branch and a remote branch have diverged,
git-rebase can re-unify them, applying each change made on the
local branch on top of the remote branch.

git-rebase--interactive is used to reorganize commits by reordering,
rewording, or squashing them. To achieve this purpose, =git= opens the
list of commits to be modified in a text editor (hence the
interactivity), as well as the actions to be performed for each of
them.


PROJECT GOALS

Back in 2016, Johannes Schindelin discussed[1] about retiring
git-rebase.sh (described here as a “hacky shell script”) in favor of
a builtin. He explained that, as it’s only a command-line parser for
git-rebase--am.sh, git-rebase--interactive.sh, and
git-rebase--merge.sh, these 3 scripts should be rewritten first.

The goal of this project is to rewrite git-rebase--interactive.sh in
C, for multiple reasons :

Performance improvements
Shell scripts are inherently slow. When Johannes Schindelin began to
rewrite some parts of git-rebase--interactive in C, its performance
increased from 3 to 5 times, depending on the platform[2].

That’s because each command is a program by itself. So, for each
command, the shell interpreter has to spawn a new process and to load
a new program (with fork() and exec() syscalls), which is an
expensive process.

Those commands can be other git commands. Sometimes, they are
wrappers to call internal C functions (eg. git-rebase--helper),
something shell scripts can’t do natively. These wrappers basically
parse the parameters, then start the appropriate function, which is
obviously slower than just calling a function from C.

Other commands can be POSIX utilities (eg. sed, cut, etc.). They
have their own problems (speed aside), namely portability.

Portability improvements
Shell scripts often relies on many of those POSIX utilities, which are
not necessarily natively available on all platforms (most notably,
Windows), or may have more or less features depending on the
implementation.

Although C is not perfect portability-wise, it’s still better than
shell scripts. For instance, the resulting binaries will not
necessarily depend on third-party programs or libraries.


RISKS

Of course, rewriting a piece of software takes time, and can lead to
regressions (ie. new bugs). To mitigate that risk, I should understand
well the functions I want to rewrite, run tests on a regular basis and
write new if needed, and of course discuss about my code with the
community during reviews.


APPROXIMATIVE TIMELINE

Normally, I would be able to work 35 to 40 hours a week. When I have
courses or exams at university, I could work between 20 and 25 hours a
week.

Community bonding --- April 23, 2018 -- May 14, 2018
/I’ll still have courses at the university during this period./

During the community bonding, I would like to dive into git’s
codebase, and to understand what git-rebase--interactive does under
the hood. At the same time, I’d communicate with the community and my
mentor, seeking for clarifications, and asking questions about how
things should or should not be done.

Weeks 1 & 2 --- May 14, 2018 -- May 27, 2018
/From May 14 to 18, I have exams at the university, so I won’t be able
to work full time./

I would search for edge cases not covered by current tests and write
some if needed.

Week 3 --- May 28, 2018 -- June 3, 2018
At the same time, I would refactor --preserve-merges in its own
shell script, if it has not been deprecated or moved in the
meantime. Dscho explained that this would be the first step of the
conversion[1]. This operation is not really tricky by itself,
as --preserve-merges is about only 50 lines of code into
git_rebase__interactive(), plus some specific functions
(eg. pick_one()).

Weeks 4 to 7 --- June 4, 2018 -- July 1, 2018
Then, I would start to incrementally rewrite
git-rebase--interactive.sh functions in C, and move them
git-rebase--helper.c. Newly-rewritten C functions are then
associated to command-line parameters to be able to use them from
shell scripts.

Examples of such conversion can be found in commits
0cce4a2756[3] (rebase -i -x: add exec commands via the
rebase--helper) and b903674b35[4] (bisect--helper:
`is_expected_rev` & `check_expected_revs` shell function in C).

There is a lot of functions into git-rebase--interactive.sh to
rewrite. Most of them are small, and some of them are even wrappers
for a single command (eg. do_next()), so they shouldn’t be really
problematic.

A couple of them are quite long (eg. pick_one()), and will probably
be even longer once rewritten in C due to the low-level nature of the
language. They also tend to depend a lot on other smaller functions.

The plan here would be to start rewriting the smaller functions when
applicable (ie. they’re not a simple command wrapper) before
working on the biggest of them.

Of course, rewriting a function should not cause any breakage in the
test suite.

Week 8 --- July 2, 2018 -- July 8, 2018
Then, I plan to polish the new code, in order to improve its
performance or to make it more readable.

Benchmarking will be done using the existing script in t/perf.

Weeks 9 & 10 --- July 9, 2018 -- July 22, 2018
When all majors functions from git-rebase--interactive.sh have been
rewritten in C, I would retire the script in favor of a builtin.

Weeks 11 & 12 --- July 23, 2018 -- August 5, 2018
In the last two weeks, I would improve the code coverage where needed.

I also plan to use this time as a backup if I am late on my planning.


IF TIMES PERMITS

 * Add an option to include the patch in the message of commits to be
   reworded, as proposed by Ævar Arnfjörð Bjarmason[5].


COMMUNICATING ABOUT MY WORK

I will communicate every week about the state of my work on the
mailing list.


ABOUT ME

My name is Alban Gruin. I am an undergraduate at the Paul Sabatier
University in Toulouse, France, where I have been studying Computer
Sciences for a year and a half. My timezone is UTC+02:00.

I have been programming in C for the last 5 years. I learned using
freely available resources online, and by attending class ever since
last year.

I am also quite familiar with shell scripts, and I have been using
git for the last 3 years.

My e-mail address is alban <dot> gruin <at> gmail <dot> com, my IRC
nick is abngrn.

My micro-project was “userdiff: add built-in pattern for
golang”[6][7].
---

You can find the Google Doc version here[8].

Regards,
Alban Gruin

[1]
https://public-inbox.org/git/alpine.DEB.2.20.1609021432070.129229@virtualbox/
[2]
https://public-inbox.org/git/cover.1472457609.git.johannes.schindelin@xxxxxx/
[3] https://git.kernel.org/pub/scm/git/git.git/commit/?id=0cce4a2756
[4] https://git.kernel.org/pub/scm/git/git.git/commit/?id=b903674b35
[5] https://public-inbox.org/git/87in9ucsbb.fsf@xxxxxxxxxxxxxxxxxxx/
[6]
https://public-inbox.org/git/20180228172906.30582-1-alban.gruin@xxxxxxxxx/
[7] https://git.kernel.org/pub/scm/git/git.git/commit/?id=1dbf0c0a
[8]
https://docs.google.com/document/d/1Jx0w867tVAht7QI1_prieiXg_iQ_nTloOyaIIOnm85g/edit?usp=sharing



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux