GSoC 2016 | Proposal | Incremental Rewrite of git bisect

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hey!

I have prepared a proposal for Google Summer of Code 2016. I know this
is a bit late, but please try to give your comments and suggestions.
My proposal could greatly improve from this. Some questions:

1. Should I include more ways in which it can help windows?
2. Should I include the function names I intend to convert?
3. Is my timeline (a bit different) going to affect me in any way?

Here is a Google doc for my proposal.
https://docs.google.com/document/d/1stnDPA5Hs3u0a8sqoWZicTFpCz1wHP9bkifcKY13Ocw/edit?usp=sharing

For the people who prefer the text only version :

-----------------------------------------------------------------------------------

Incremental rewrite of Git bisect

About Me

Basic Information


Name           Pranit Bauva

University     IIT Kharagpur

Major            Mining Engineering

Email            pranit.bauva@xxxxxxxxx

IRC              pungi-man

Blog             http://bauva.in

Timezone     IST (UTC +5:30)

Background

I am a first year undergraduate in the department of Mining
Engineering at Indian Institute of Technology, Kharagpur. I am an open
source enthusiast. I am a part of Kharagpur Linux Users Group which is
basically a group of open-source enthusiasts. I am quite familiar with
C and I have been using shell for some time now and still find new
things about it everyday. I have used SVN when I was on Windows and
then I switched to Git when I moved to linux. Git seems like magic. I
always wanted to involve in the development process and Google Summer
of Code is an a awesome way to achieve it.


Abstract

Git bisect is a frequently used command which helps the developers in
finding the commit which introduced the bug. Some part of it is
written in shell script. I intend to convert it to low level C code
thus making them builtins. This will increase Git’s portability.
Efficiency of git bisect will definitely increase but it would not
really matter much as most of the time is consumed in compiling or
testing when in bisection mode but it will definitely reduce the
overhead IO which can make the heavy process of compiling relatively
lighter.


Problems Shell creates

System Dependencies

Using shell code introduces various dependencies even though they
allowing prototyping of the code quickly. Shell script often use some
POSIX utilities like cat, grep, ls, mkdir, etc which are not included
in non-POSIX systems by default. These scripts do not have access to
the git’s internal low level API. So even trivial tasks have to be
performed by spawning new process every time. So when git is ported to
windows, it has to include all the utilities (namely a shell
interpreter, perl bindings and much more).

Scripts introduce extra overheads

Shell scripts do not have access to Git’s internal API which has
excellent use of cache thus reducing the unnecessary IO of user
configuration files, repository index and filesystem access. By using
a builtin we could exploit the cache system thus reducing the
overhead. As compiling / testing already involves quite a number of
resources, it would be good if we could do our best to make more
resources available for that.

Potential Problems

Rewriting may introduce bugs

Rewriting the shell script to C might introduce some bugs. This
problem will be properly taken care of in my method of approach
(described below). Still this approach will definitely not guarantee
that the functionality of the new will be exactly similar to the old
one, though it will greatly reduce its possibility. The reviews
provided by the seniors in the git community would help a lot in
reducing bugs since they know the common bugs and how to work around
them. The test suite of git is quite nice which has an awesome
coverage.

Rewritten can be hard to understand

Git does not like having many external dependencies, libraries or
executables other than what is provided by git itself and the
rewritten code should follow this. C does not provide with a lot of
other facilities like text processing which shell does whose C
implementation often spans to multiple lines. C is also notorious for
being a bit “cryptic”. This problem can be compensated by having well
written documentation with well defined inputs, outputs and behavior.

A peek into git bisect

How does it help?

Git bisect helps the software developers to find the commit that
introduced a regression. Software developers are interested in knowing
this because a commit changes a small set of code (most time).  It is
much easier to understand and fix a problem when you know only need to
check a very small set of changes, than when you don’t know where to
look at it. It is not that the problem will be exactly in that commit
but it will be related to the behavior introduced in the commit.
Software bugs can be a nightmare when the code base is very large.
There would be a lot of sleepless night in figuring out the part which
causes the error. This is where git bisect helps. This is the one of
the most sought after tool in software development.

Git bisect algorithm (short description)

Git’s commits form a directed acyclic graph (DAG) which makes it a bit
difficult to find the bisection commit. The algorithm it uses is :

Step 1: Keeps only the ancestors of “bad” commit (including “bad” one)
and the commits that are not the ancestors of “good” commit (excluding
“good” one). This helps in reducing a lot of “interesting” commits.

Step 2: Starting from the good ends (or end), associate some kind of
“weight” which depends on its distance from the good commit also
considering some other factors.

Step 3: Find the midpoint of the max weight which will be the bisection commit.

Step 4: Repeat Step 1 but now with one more “good” commit which will
cause rearranging of “weights” thus finishing the algorithm which uses
binary search.

Further details about the algorithm can be found in this article
written by Christian Couder.

Current status of git bisect

The bisect code is currently spread across multiple files namely
git-bisect.sh, bisect.c, bisect.h, and bisect--helper.c . The
algorithmic part is written in bisect.c . git-bisect.sh acts like a
glue connecting parts of bisect.c with git-rev-list (which helps in
listing the commits in various types and formats), git-rev-parse
(which helps in knowing the SHA-1 checksum for the ref passed),
git-symbolic-refs (to read, modify and delete symbolic refs),
git-show-branch (which helps by showing the branches and their
respective commits), git-update-ref (to safely update the SHA-1
checksum stored in a ref safely), git-for-each-ref (to get information
on each ref with given pattern, format, keys and much more) and
git-check-ref-format (to check whether the ref has a proper format and
doesn’t contain some “special” characters which aren’t allowed) though
this is soon going to be replaced by using a sub command
`--check-term-format` of bisect--helper whose patch is in its final
stages. git-bisect.sh is responsible for writing the respective data
concerning the status on bisect to the various files namely
BISECT_LOG, BISECT_NAMES, BISECT_START, BISECT_ANCESTORS_OK,
BISECT_LOG, refs/bisect and those which would be included in the
directory once bisection moves further.

How will I go about with this project?

Understand related areas

In the start, I will focus on understanding the things that
git-bisect.sh depends on. I will try to debug that part of code and
understand it through documentation. I will also learn to use some
more parts of API which I am currently unfamiliar with concerning this
project. I will also gain more familiarity with running gdb sessions
for big projects.

Discussion about the future look of bisect

When converting from shell to C, there needs to be quite some amount
of changes rather than just plainly translating the code. This is the
part where I will discuss with the git community as to what parts they
wish they could change it or which are better changed in C. Since C
and shell are quite different, approaches which seem good in shell are
not that good in C and vice versa. So having a thorough discussion can
really help.

Convert each function individually

Then I will convert each function from shell script to a C builtin in
bisect--helper.c individually. This raises a very valid questions as
to how to test whether the function performs the tasks as intended
since tests suites are written for git-bisect.sh and not
bisect--helper.c . To take care for this problem, I will first convert
each function to its C builtin and then provide utility in
bisect--helper.c so that git-bisect.sh can easily communicate with it.
In this way, I can make sure that git-bisect.sh uses a corresponding C
builtin rather than origin shell function to run the tests which will
help in recognising for any errors (if any) which the patch can have.
To help git-bisect.sh communicate, I will make the functions as sub
commands in bisect--helper.c so I can ensure that bisect--helper can
react in a way which is independant of other sub commands.

Unification of bisect.c and bisect--helper.c

This will unify the algorithmic and non-algorithmic parts of bisect
bringing them under one heading to make the code clean.

Thorough testing of bisect

This will be dedicated for thorough deep testing for any problems
which crept in after the shell to C conversion. I will also aim to
include some more tests to improve coverage of bisect. If time
persists, I will dig further in examining the failing tests and try
and fix the bugs.

Optimizing

I will mainly try to find ways in which I can make the code run better
or in a nicer way.

Documenting

Since C is notorious for being a bit “cryptic”, the documentation can
really help the newcomers to ease into diving the codebase.

Goals

Incremental rewrite of `git-bisect.sh`

Get all the code written merged.

Some important points

My semester starts from 18th July so I will start working early just
to get reviews of the patches thoroughly and
code/test/optimize/document properly just to avoid end time hustle and
so that I can have some buffer time. Also, it would save me from the
extra burden once my academic session starts. If time permits then I
will work on Stephan Beyer’s patches which are related to bisect
algorithm and seem quite interesting to me.

Timeline (Tentative)

1st April - 22nd April

Spend time going to the codebase and get familiar with using gdb with
it. I will mostly concentrate on reading up bisect related areas as
mentioned above.

2nd May - 8th May (1st week)

Start discussion about the design of bisect and note down the points
which some developers want to be changed. Also see the test suite as
to what is the expected behavior of a particular bisect related
command.

9th May - 26th June (2nd-8th week)

I count 23 functions (-1 probably) currently in git-bisect.sh . Each
function should take on an average 2 days including code/test/review
and some functions are quite small. Also by this time I will be quite
familiar with the codebase, review process and conventions so I won’t
make many mistakes and the reviews can go faster. Still I have kept
some buffer time for just in case some bug comes (which is quite a
reasonable possibility).


27th June - 3rd July (9th week)

Testing - I will mostly test the parts of code that I have written.
Also I will explore more on the use of git bisect and try to improve
its coverage.


4th July - 18th July (10th - 11th week)

Bug fixing - I will try and fix up some leftover bugs which were
either introduced because of the conversion or which were already
present.

18th July - 24th July (12th week)

Optimizing the C code I have written. Document the code enough so that
it is easier for a newcomer to dive in.

25th July - 23rd August (left over time)

I will keep this as a buffer period if anything goes wrong with this
timeline. Also I wish to work on Stephan Beyer’s patches concerning
bisect which currently contain 1 leftover bug.


Micro-project attempted

I attempted a micro project which will add configurations for commonly
used command-line options (git commit -v). The link for the patch is
here. This is in its last stages and is soon expected to be complete.

Demo function conversion

As suggested by Johannes Schindelein <Johannes.Schindelein@xxxxxx>, I
have tried converting one function from shell to C, the function being
check_term_format() to have a taste of how to go about with this
project. Here is a link to the patch. This patch is in its final
stage. This exercise was quite helpful!

Plans for Summer

I currently don’t have any plans for this summer. Though my new
semester starts from 18th July. So I will start early.

Possible Mentors

Christian Couder <christian.couder@xxxxxxxxx>

Matthieu Moy <Matthieu.Moy@xxxxxxxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]