Re: [GSoC] Designing a faster index format

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



As suggested on IRC, here is my first draft proposal. Should the proposal 
include more more technical material already, or just the general sketch of 
the project?

Designing a faster index format

-- Problem --
The current git index is pretty slow when working with large git repositories,
because the whole index has to be rewritten for nearly every operation. For
example for adding a single file with git add, even though only one single blob
sha-1 is changed in the index, the way it's currently implemented git will need
to recompute a new hash over the whole index.

-- Proposed solution --
The proposed solution is to rework the index, in order to make it possible to
have a complexity of O(log(n)), where n is the number of index entries, for 
simple operations like adding files. 

The data format will also be easily parsable in order to make it easier for
programs, that read the index (like jgit or libgit2), to adapt to the new 
format.

The solution would consist of two steps. In the first step the index on the
filesystem would be replaced, but not mmap'd for direct access, which would 
mainly save on I/O operations. In the second step the core parts would also be
made aware of the new index, as it would be mmap'd for direct access.

Another way to speed up the hashing would be to exchange the SHA-1 hash for
something faster. That may however not be feasible when keeping the current
in-memory structure.

To make the project feasible for Google Summer of Code, the plan is to
implement the first part and if this goes smoother then expected and there is
still time left, the second part should also be implemented as extra work.

-- Timeline --
24/04 - 12/05: Getting familiar with the old index
13/05 - 26/05: Come up with a exact plan for the new index and document it.
Check feasibility of exchanging SHA-1 with a faster checksum algorithm.
27/05 - 30/06: Writing the new index format to disk and making it only rewrite 
the parts that really need to be changed (Should be less writes then currently
necessary).
/* Development work will be a bit slower from 18/06 to 21/07 because at my
 * University there are exams in this period. I probably will only be able to
 * work half the hours. I'll be back up to full speed after that. */
31/06 - 28/07: Parse the index from the disk to the current in-memory format.
29/07 - 13/08: Optimize the algorithm and profile the gains. 

-- Why git --
I'm using git since about 2-3 years and wanted to contribute to it earlier, but
couldn't find the time to do it. I would also like to continue contributing
once the Summer of Code is over.--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]