Re: [GSoC] Designing a faster index format

Thomas Gummerer <italyhockeyfeed@xxxxxxxxx> · Mon, 2 Apr 2012 23:02:39 +0200

After taking into consideration Junios and Duys comments, and some more
discussion on IRC, I have written a new draft of the proposal. The proposed
solution is based on a tree-structure, as the append only structure penalizes
the read path, which after doing some measurements isn't the right tradeoff.

Included in this proposal are also the index formats that were taken into
consideration but scrapped for one or another reason.

First I'd like to thank all those who helped me by discussing the ideas here
on the mailing list or over at the #git-devel irc channel. 

-- Credits --
Big thanks for discussing the ideas with me on IRC goes to charon, jast,
mhagger, shruggar, GitZilla, andrew_sayers and barrbrain. Hope I didn't forget 
anyone. Thanks also to those who discussed my ideas with me on the mailing list
(Nguyen Thai Ngoc Duy, Thomas Rast, Junio C Hamano)

And here is the proposal:
Designing a faster index format

-- Problem --
The current git index is pretty slow when working with large git repositories,
because the whole index has to be rewritten for nearly every operation. For
example for adding a single file with git add, even though logically only one 
single blob sha-1 is changed in the index, git has to recompute the hash over 
the whole index and rewrite the index file. In addition to that the speed up 
for writing the index can not come to the expense of the read operation, since 
it is executed more often throughout a normal developers day compared to
writing the index. The current index also doesn't allow simple partial reading,
which could help speed up some commands, like git status and git diff, which 
often don't need the whole index.

-- Proposed solution --
The proposed solution is to redesign the index to a B-tree based format. This
allows changes to the index in O(log(n)) time, with n being the number of
entries in the index. 

The sha1 hash, to verify the index isn't corrupted, will have to be computed 
over log(s) data in the worst case, where s is the size of the index, which 
will be bigger then the number of entries because of the structure of the 
b-tree.

The new index format will also save on the size of each entry, by storing each
path only once, and using it for all files in the same path. This will reduce
the amount of data we have to calculate the hash over, when both reading and
writing the index, thus having faster operations especially on repositories
with deep directory structures.

The tree structure will also allow for fast partial loading of the index. Since
every bucket of the tree will have it's own hash, the hash will only need to be
calculated for the parts we load. This reduces the amount of data for which the
hash will need to be calculated. Commands like git status and git diff will
benefit from this. The benefit could be fully explored by a implementation with 
inotify, of which Thomas Rast created a POC. 
(https://github.com/trast/git/commit/6c9825fdca76d01fb5a83923558831743ae477bc)

In order to be able to only rewrite the parts of the index that really changed,
the current lock file structure will have to be changed. The new lock file
will keep a journal of the changes, to be able to recover in case of a crash.
Deleting the lockfile "commits" the changes.

The in-memory structure will be modified in two steps. In the first step it
will only be modified to keep track of the changes, in order to be able to only
write the changes to disk and eliminate the need to rewrite the whole tree. In 
the second step the in-memory structure will be changed to reflect the tree 
format, getting the full benefits from the new format.

To ensure backward compatibility, git shall keep the ability to read version 
2/3 of the index. The user shall also have the possibility to configure git to 
write the index in either the new or the old format. While this will produce 
some code overhead, it will make the life of git users which don't use core git 
exclusively easier in the transition phase. If the user sets the write format to
the new format and the repository is a already existing version 2/3 repository, 
the old index will be transformed to the new format. Transformations in the 
other direction will also be possible, since the in-memory format is the same.

Once the in-memory structure is changed, making git backward compatible will
still be possible, even though it will come at the expense of the 
reading/writing time of the old index, since the tree will have to be 
constructed in memory from the on disk format, and transformed back to the flat
format when writing it back to disk.

To make the project feasible for Google Summer of Code the in-meory structure
will only be modified to keep track of the changes, not making it 
tree-structured yet and the partial loading will not be implemented.

-- Solutions that were also considered --
These solutions will not be described in as much detail as the tree-structure,
but they were also considered in the process of chosing the best index format
and will be described below.

- Append-only data structure
An append-only data structure will allow for changes to the index in O(k) time, 
with k being the number of files that changed. To reach this goal, the first 
part of the index will be sorted and changes will always be written to the end, 
in the order in which the changes occured. This last part of the index will be 
merged with the rest using a heuristic, which will be the execution of git 
commit and git status.

To make sure the index isn't corrupted, without calculating the sha1 hash for
the whole index file every time something is changed, the hash is always
calculated for the whole index when merging, but when only a single entry is 
changed the sha-1 hash is only calculated for the last change. This will 
increase the cost for reading the index to log(n) + k * log(k) where n is the 
number of entries in the sorted part of the index and k is the number of entries
in the unsorted part of the index, which will have to be merged with the rest 
of the index.

The index format shall also save on file size, by storing the paths only once,
which currently are stored for every file. This can make a big difference
especially for repositories with deep directory structures. (e.g. in the webkit
repository this change can save about 10MB on the index). In addition to that it
can also give the index a structure, which in turn makes it easier to search
through. Except for the appended part, but that should never grow big enough to
make a linear search through it costly.

With this index format the lock file could be a simple empty file, since if a
operation fails, only the last entry would be corrupted making it easy to
recover.

The current in-memory structure of git would only have to be slightly changed,
adding a flag or keeping a list of the changed files. The rest of it could be
changed in an other step, to better represent the new structure of the index.

To ensure backward compatibility, git shall keep the ability able to read 
version 2/3 of the index. The user shall also have the possibility to configure
git to write the index in either the new or the old format. While this will
produce some code overhead, it will make the life of git users which don't use 
core git exclusively easier in the transition phase. If the user sets the write 
format to the new format and the repository is a already existing version 2/3 
repository, the old index will be transformed to the new format. 

This idea was dropped, because as mentioned in the problem description reads
are more common then writes and therefore trading write speed for read speed
is not a good tradeoff.

- Database format
A database format as index structure will allow for changes to the index in
O(log(n)) time for a single change, with n always being the number of entries
in the index, assuming a b-tree index on the path in the database.

The check for the index corruption could be left to the database, which would
have about the same cost as the check in the tree-structure, in the read and
write direction. Partial loading with the right indices would also have the 
same cost as the tree-structure, by executing a simple select query.

The drawback of this solution however would be the introduction of a database
library (one could also write one, but that certainly would be overkill).
Another drawback would be that it's harder to read for programs like libgit2
and jgit and the codebase certainly wouldn't get any cleaner.

- Padded structure
Another index format that was taken into consideration was to create a padded
structure. Basically that would be an append-only like structure, but split into
sections, where every section would leave some space at the end for appending
changes. This would leave us with a complexity of O(k) where k is the number of
sections. 

This structure will bring advantages for faster partial loading, when splitting
the sections in the right way. 

The structure was however only briefly taken into consideration, since it would
have close to no advantages (except for the partial loading) compared to the
append-only structure, would make the index file larger due to the spaces for
appending and have the same drawbacks as the append-only structure.

-- Timeline --
24/04 - 01/05: Document the new index format.
02/05 - 21/05: Map the current internal structure to the new index format.
22/05 - 07/06: Change the current in-memory structure to keep track of the
changed files.
08/06 - 16/06: Write the index to disk in both the old and the new format
depending on the choice of the user and make sure only the changed parts are 
really written to disk in the new format.
17/06 - 21/07: Parse the index from disk to the current in-memory format.
/* Development work will be a bit slower from 18/06 to 21/07 because at my
 * University there are exams in this period. I probably will only be able to
 * work half the hours. I'll be back up to full speed after that. */
22/07 - 10/08: Read the new structure and map it to the current in meory format.
Make sure the old format still gets read correctly.
11/08 - 13/08: Test the new index and profile the gains compared to the old
format.

-- Why git --
I'm using git since about 2-3 years and wanted to contribute to it earlier, but
couldn't find the time to do it. I would also like to continue contributing
once the Summer of Code is over.

-- About me --
I'm Thomas Gummerer (@tgummerer on Twitter, tgummerer on IRC), 21 years old
from Italy. I'm currently a 3rd year Bachelor student in Applied Computer 
Science at the Free University of Bolzano. I started programming in High School 
about 8 years ago with Pascal and then learned C and Java. For some of my
projects you can visit my homepage (http://tgummerer.com/projects), most of them
were for university and some personal projects I did in my free time. My blog is
also on the same homepage, but not really active. Unfortunately I couldn't yet 
participate in any bigger open source project, although I'm interested in
it basically since I started programming. 

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html