Re: [GSoC] Designing a faster index format

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Thomas

Thomas Gummerer <t.gummerer@xxxxxxxxx> writes:

> -- Proposed solution --
> The proposed solution is to rework the index, in order to make it possible to
> have a complexity of O(log(n)), where n is the number of index entries, for 
> simple operations like adding files. More complex operations shall also
> benefit from the new index format, although a complexity of O(log(n)) might not
> be reached, while keeping the index format easy enough to parse for .git reading
> programs. The amount of data for which that the hash will be computed however
> shall be log(n). N is the number of entries in the index.

Can you phrase this in terms of three variables: the number of entries
'n' in the index, the number of entries 'k' (or in the case of
extensions, the amount of data) that must be changed, and the size of
the index 's'?

Bear in mind that the current code uses a lock-write-rename scheme and
I'm not sure we can (or for that matter want to) change that.

That should clear up the picture somewhat and also answer Junio's concerns.

> In phase one the pysical index format shall be replaced with the new index 
> format, which does neither require a full recalculation of the sha-1 hash nor a
> full rewrite of the index to the disk. The new index format will be mapped to 
> the current in-memory structure, which will only be changed in phase two. For 
> further optimization the cache-tree sha-1 hashes shall be mapped to the new 
> index format, which should avoid cache-tree invalidations. In phase one the 
> savings will mainly be on disk I/O and commands that usually cause a lot of 
> cache-tree invalidations. To ensure backward compatibility, git shall keep the
> ability able to read version 2/3 of the index. The user shall also have the 
> possibility to configure git to write the index in either the new or the old 
> format. While this will produce some code overhead, it will make the life of git 
> users which don't use core git exclusively easier in the transition phase. 
> If the user sets the write format to the new format and the repository is a 
> already existing version 2/3 repository, the old index will be transformed to 
> the new format.

Can you split this somewhat?  My wall-of-text-detector is complaining.

- Backwards compatibility: good.  What happens after stage two, are we
  still going to support writing the old format?  If not, why?

- cache-tree: I'm not sure you can skip the invalidations, at least not
  without the tie-in.  We primarily want to be able to change the
  cache-tree data -- or for that matter any extension data -- cheaply
  without rewriting (or rehashing, see the s/n/k issue) the entire
  index.

- cache-tree tie-in: I realize I tossed this in the air without any
  hints on how you could achieve it, but since you also allocate quite
  some development time to it I think you should outline a plan on how
  to do this.

- I'm sure you can find another point :-)

> Another way to speed up the hashing would be to exchange the SHA-1 hash for
> something faster. That may however not be feasible while keeping the current
> in-memory structure.

Hum, if you look where and over what this hash is currently computed,
you'll see what the right answer to that last non-question is...

> 24/04 - 12/05: Getting familiar with the old index

What Junio said.  On top of that, I think this shouldn't take three
weeks even if you still spend some time within the project.

> 24/06 - 21/07: Write the index to disk in both the old and the new format
> depending on the choice of the user and make sure only the changed parts are 
> really written to disk in the new format.

What's interesting here is how you determine the changed part.  Are you
planning to change all users to properly mark changed entries, so that
everyone benefits (at a huge amount of code churn) or do you want to
take another approach?

> /* Development work will be a bit slower from 18/06 to 21/07 because at my
>  * University there are exams in this period. I probably will only be able to
>  * work half the hours. I'll be back up to full speed after that. */

Out of curiosity, what courses are you taking exams in?

> 22/06 - 04/08: Parse the index from the disk to the current in-memory format.
> 05/08 - 10/08: Make sure the converstion from the old format to the new format
> works without problems.
> 11/08 - 13/08: Test the new index and profile the gains compared to the old
> format.

I see this is partially overlaps the entries before it.  What's the plan here?

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]