As suggested on IRC, here is my first draft proposal. Should the proposal include more more technical material already, or just the general sketch of the project? Designing a faster index format -- Problem -- The current git index is pretty slow when working with large git repositories, because the whole index has to be rewritten for nearly every operation. For example for adding a single file with git add, even though only one single blob sha-1 is changed in the index, the way it's currently implemented git will need to recompute a new hash over the whole index. -- Proposed solution -- The proposed solution is to rework the index, in order to make it possible to have a complexity of O(log(n)), where n is the number of index entries, for simple operations like adding files. The data format will also be easily parsable in order to make it easier for programs, that read the index (like jgit or libgit2), to adapt to the new format. The solution would consist of two steps. In the first step the index on the filesystem would be replaced, but not mmap'd for direct access, which would mainly save on I/O operations. In the second step the core parts would also be made aware of the new index, as it would be mmap'd for direct access. Another way to speed up the hashing would be to exchange the SHA-1 hash for something faster. That may however not be feasible when keeping the current in-memory structure. To make the project feasible for Google Summer of Code, the plan is to implement the first part and if this goes smoother then expected and there is still time left, the second part should also be implemented as extra work. -- Timeline -- 24/04 - 12/05: Getting familiar with the old index 13/05 - 26/05: Come up with a exact plan for the new index and document it. Check feasibility of exchanging SHA-1 with a faster checksum algorithm. 27/05 - 30/06: Writing the new index format to disk and making it only rewrite the parts that really need to be changed (Should be less writes then currently necessary). /* Development work will be a bit slower from 18/06 to 21/07 because at my * University there are exams in this period. I probably will only be able to * work half the hours. I'll be back up to full speed after that. */ 31/06 - 28/07: Parse the index from the disk to the current in-memory format. 29/07 - 13/08: Optimize the algorithm and profile the gains. -- Why git -- I'm using git since about 2-3 years and wanted to contribute to it earlier, but couldn't find the time to do it. I would also like to continue contributing once the Summer of Code is over.-- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html