Re: [GSoC] Designing a faster index format

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 26 Mar 2012 14:14:23 -0700

Thomas Gummerer <t.gummerer@xxxxxxxxx> writes:

> The proposed solution is to rework the index, in order to make it possible to
> have a complexity of O(log(n)), where n is the number of index entries, for 
> simple operations like adding files. More complex operations shall also
> benefit from the new index format, although a complexity of O(log(n)) might not
> be reached, while keeping the index format easy enough to parse for .git reading
> programs. The amount of data for which that the hash will be computed however
> shall be log(n). N is the number of entries in the index.

Where does log(N) come from, and is that a hard requirement?

Rephrasing your problem description, you want to address the inefficiency
that we have to write the full 1m entries to a new file in order to update
only one file when the index tracks 1m paths.  

Wouldn't the goal be more in line with that problem statement if you
instead aim to make the cost proportional to the number of entries that
are updated, regardless of how many entries exist in the index?

> In phase one the pysical index format shall be replaced with the new index 
> format, which does neither require a full recalculation of the sha-1 hash nor a
> full rewrite of the index to the disk. The new index format will be mapped to 
> the current in-memory structure, which will only be changed in phase two. For 
> further optimization the cache-tree sha-1 hashes shall be mapped to the new 
> index format, which should avoid cache-tree invalidations.

It is unclear what you meant by the last sentence.  The cache-tree
invalidation is a critical part of the machinery that allows later
write-tree to reuse the already computed subtree object names, without
having to recompute subtree object names until they are really necessary
(i.e. when writing a tree out).  By "avoiding" it, are you going to make
write-tree always recompute all the subtree object names?  Or are you
planning to keep the cached tree information always up to date by
recomputing subtree object names and keeping them in the index even when
they are not yet needed?  If the latter, how would you handle when a part
of the index contains unmerged entries?

Right now, the code that updates the in-core index records "Is the in-core
index clean, or modified?" bit and nothing else.  Without updating the
in-core structure and the codepaths that access it, how is it possible for
your phase I to selectively write out only the modified (either added,
updated or removed) part of it?  In other words, I do not think it is
realistic to expect that the core parts to stay oblivious to the new index
semantics during the "phase one".

> -- Timeline --
> 24/04 - 12/05: Getting familiar with the old index

It might make more sense to write the "proposed solution" *after* this
step is done; otherwise you wouldn't even know the problems you are trying
to solve.  That may mean that this part of the schedule may need to be
shortened or started way before the beginning of GSoC.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html