Re: [GSoC] Designing a faster index format

Thomas Gummerer <t.gummerer@xxxxxxxxx> · Sun, 25 Mar 2012 03:28:25 +0200

I have now read through the methods and understand the cache_tree a bit 
better (thanks Thomas for the explanation :) ) and have rewritten my proposal 
draft. Any suggestions are very welcome.

Designing a faster index format

-- Problem --
The current git index is pretty slow when working with large git repositories,
because the whole index has to be rewritten for nearly every operation. For
example for adding a single file with git add, even though logically only one 
single blob sha-1 is changed in the index, git has to recompute the hash over 
the whole index and rewrite the index file.

-- Proposed solution --
The proposed solution is to rework the index, in order to make it possible to
have a complexity of O(log(n)), where n is the number of index entries, for 
simple operations like adding files. More complex operations shall also
benefit from the new index format, although O(log(n)). While it may not be
possible to get to a complexity of O(log(n)), while keeping the index format
still easy enough to parse for other programs that read the index, the amount of
data for which that the hash will be computed shall be log(n).

The problem will be solved in two phases. 

In phase one the pysical index format shall be replaced with the new index 
format, which doesn't require a full recalculation of the sha-1 hash and a full
rewrite to the disk. The new index format will be mapped to the current
in-memory structure, which will only be changed in phase two. For further 
optimization the cache-tree the sha-1 hashes shall be mapped to the index, which
should avoid cache-tree invalidations. In phase the savings will mainly be on
disk I/O.

In phase two the core parts of git would be made aware of the new index format,
changing the current in-memory structure. The in-memory structure will be left
untouched in phase one, to allow for more testing and to make sure there won't
be any problems with the new index. After this phase git takes full advantage
of the features of the new index.

Another way to speed up the hashing would be to exchange the SHA-1 hash for
something faster. That may however not be feasible when keeping the current
in-memory structure.

To make the project feasible for Google Summer of Code, the plan is to
implement phase one and if this goes smoother then expected and there is
still time left, phase two should also be implemented as bonus work.

-- Timeline --
24/04 - 12/05: Getting familiar with the old index
24/04 - 05/05: Decide on the data structure for the new index format and 
document it. Check the feasibility of exchanging SHA-1 with a faster hash
algorithm.
06/05 - 26/05: Map the current internal structure to the new index format.
27/05 - 23/06: Tie the cache-tree hashes to the index hashes.
24/06 - 21/07: Write the index to disk in the new format and make sure only the
changed parts are really written to disk.
/* Development work will be a bit slower from 18/06 to 21/07 because at my
 * University there are exams in this period. I probably will only be able to
 * work half the hours. I'll be back up to full speed after that. */
22/06 - 04/08: Parse the index from the disk to the current in-memory format.
05/08 - 13/08: Optimize the algorithm and profile the gains. 

-- Why git --
I'm using git since about 2-3 years and wanted to contribute to it earlier, but
couldn't find the time to do it. I would also like to continue contributing
once the Summer of Code is over.

-- About me --
I'm Thomas Gummerer (@tgummerer on Twitter, tgummerer on IRC), 21 years old
from Italy. I'm currently a 3rd year Bachelor student in Applied Computer 
Science at the Free University of Bolzano. I started programming in High School 
about 8 years ago with Pascal and then learned C and Java. For some of my
projects you can visit my homepage (http://tgummerer.com/projects), most of them
were for university and some personal projects I did in my free time. My blog is
also on the same homepage, but not really active. Unfortunately I couldn't yet 
participate in any bigger open source project, although I'm interested in
it basically since I started programming. 
On Mar 23, 2012, at 11:10 AM, Thomas Rast wrote:

> Hi Thomas,
> 
> Thomas Gummerer <t.gummerer@xxxxxxxxx> writes:
> 
>> 24/04 - 12/05: Getting familiar with the old index
> 
> I am aware that this falls into the period for "getting up to speed",
> however I think if you spend a day or so *now* reading through
> read_index, write_index and the relevant docs.  We can then give you the
> brief rundown on cache_tree, and you'll have a much clearer picture of
> how things fall together.
> 
> I think that should lead to a lot of enlightenment and a greatly
> improved proposal.
> 
> - Thomas
> 
> -- 
> Thomas Rast
> trast@{inf,student}.ethz.ch

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html