[git wiki PATCH 2/3] "Designing a faster index format" project

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



---
 SoC-2012-Ideas.md |   41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/SoC-2012-Ideas.md b/SoC-2012-Ideas.md
index 145b379..59d1baf 100644
--- a/SoC-2012-Ideas.md
+++ b/SoC-2012-Ideas.md
@@ -99,3 +99,44 @@ computers is a plus.
 
 Proposed by: Thomas Rast
 Possible mentor(s): Thomas Rast
+
+Designing a faster index format
+-------------------------------
+
+Git is pretty slow when managing huge repositories in terms of files
+in any given tree, as it needs to rewrite the index (in full) on
+pretty much every operation.  For example, even though _logically_
+`git add already_tracked_file` only changes a single blob SHA-1 in the
+index, Git will verify index correctness during loading and recompute
+the new hash during writing _over the whole index_.  It thus ends up
+spending a large amount of time simply on hashing the index.
+
+A carefully designed index format could help in several ways.  (For the
+complexity estimates below, let n be the number of index entries or
+the size of the index, which is roughly the same.)
+
+ * The work needed for something as simple as entering a new blob into
+   the index, which is possibly the most common operation in git
+   (think `git add -p` etc.) should be at most log(n).
+
+ * The work needed for a more complex operation that changes the
+   number of index entries will have to be larger unless we get into
+   database land.  However the amount of data that we SHA-1 over
+   should still be log(n).
+
+ * It may be possible to store the cache-tree data directly as part of
+   the index, always keeping it valid, and using that to validate
+   index consistency throughout.  If so, this would be a big boost to
+   other git operations that currently suffer from frequent cache-tree
+   invalidation.
+
+Note that there are other criteria than speed: the format should also
+be as easy to parse as possible, so as to simplify work for the other
+.git-reading programs (such as jgit and libgit2).  For the same
+reason, you will also have to show a significant speed boost as
+otherwise the break in compatibility is not worth the fallout.
+
+The programming work will be in C, as it replaces a core part of git.
+
+Proposed by: Thomas Rast
+Possible mentor(s): Thomas Rast
-- 
1.7.9.2.467.g7fee4

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]