Re: [GSoC] Designing a faster index format

Thomas Gummerer <t.gummerer@xxxxxxxxx> · Tue, 27 Mar 2012 13:08:30 +0200

On Mar 26, 2012, at 11:14 PM, Junio C Hamano wrote:

> Thomas Gummerer <t.gummerer@xxxxxxxxx> writes:
> 
>> The proposed solution is to rework the index, in order to make it possible to
>> have a complexity of O(log(n)), where n is the number of index entries, for 
>> simple operations like adding files. More complex operations shall also
>> benefit from the new index format, although a complexity of O(log(n)) might not
>> be reached, while keeping the index format easy enough to parse for .git reading
>> programs. The amount of data for which that the hash will be computed however
>> shall be log(n). N is the number of entries in the index.
> 
> Where does log(N) come from, and is that a hard requirement?

The log(n) comes from the plan to implement the index in a tree-like structure. It
isn't a hard requirement, but it certainly shouldn't be much slower than that, otherwise
the change wouldn't make sense.

> Rephrasing your problem description, you want to address the inefficiency
> that we have to write the full 1m entries to a new file in order to update
> only one file when the index tracks 1m paths.  
> 
> Wouldn't the goal be more in line with that problem statement if you
> instead aim to make the cost proportional to the number of entries that
> are updated, regardless of how many entries exist in the index?

Making the cost proportional to the number of entries that are updated would
be the superior solution, but probably not possible to achieve.

> 
>> In phase one the pysical index format shall be replaced with the new index 
>> format, which does neither require a full recalculation of the sha-1 hash nor a
>> full rewrite of the index to the disk. The new index format will be mapped to 
>> the current in-memory structure, which will only be changed in phase two. For 
>> further optimization the cache-tree sha-1 hashes shall be mapped to the new 
>> index format, which should avoid cache-tree invalidations.
> 
> It is unclear what you meant by the last sentence.  The cache-tree
> invalidation is a critical part of the machinery that allows later
> write-tree to reuse the already computed subtree object names, without
> having to recompute subtree object names until they are really necessary
> (i.e. when writing a tree out).  By "avoiding" it, are you going to make
> write-tree always recompute all the subtree object names?  Or are you
> planning to keep the cached tree information always up to date by
> recomputing subtree object names and keeping them in the index even when
> they are not yet needed?  If the latter, how would you handle when a part
> of the index contains unmerged entries?

The plan was to do the latter, if it is feasible to keep them always up to date,
without slowing other commands down. For the unmerged entries, I don't
have a plan yet, but I'll try to come up with one, when understanding the
source code better.

> Right now, the code that updates the in-core index records "Is the in-core
> index clean, or modified?" bit and nothing else.  Without updating the
> in-core structure and the codepaths that access it, how is it possible for
> your phase I to selectively write out only the modified (either added,
> updated or removed) part of it?  In other words, I do not think it is
> realistic to expect that the core parts to stay oblivious to the new index
> semantics during the "phase one".

When it's impossible to write only the modified part of the index without
changing the in-core parts, the in-core parts will only changed in the
minimal possible way, adding only a flag for the changed parts, so only
those can be effectively written to disk.

>> -- Timeline --
>> 24/04 - 12/05: Getting familiar with the old index
> 
> It might make more sense to write the "proposed solution" *after* this
> step is done; otherwise you wouldn't even know the problems you are trying
> to solve.  That may mean that this part of the schedule may need to be
> shortened or started way before the beginning of GSoC.

Yes, I'm currently on this, reading through documentation and source code.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html