Thomas Gummerer <t.gummerer@xxxxxxxxx> writes: > The proposed solution is to rework the index, in order to make it possible to > have a complexity of O(log(n)), where n is the number of index entries, for > simple operations like adding files. More complex operations shall also > benefit from the new index format, although a complexity of O(log(n)) might not > be reached, while keeping the index format easy enough to parse for .git reading > programs. The amount of data for which that the hash will be computed however > shall be log(n). N is the number of entries in the index. Where does log(N) come from, and is that a hard requirement? Rephrasing your problem description, you want to address the inefficiency that we have to write the full 1m entries to a new file in order to update only one file when the index tracks 1m paths. Wouldn't the goal be more in line with that problem statement if you instead aim to make the cost proportional to the number of entries that are updated, regardless of how many entries exist in the index? > In phase one the pysical index format shall be replaced with the new index > format, which does neither require a full recalculation of the sha-1 hash nor a > full rewrite of the index to the disk. The new index format will be mapped to > the current in-memory structure, which will only be changed in phase two. For > further optimization the cache-tree sha-1 hashes shall be mapped to the new > index format, which should avoid cache-tree invalidations. It is unclear what you meant by the last sentence. The cache-tree invalidation is a critical part of the machinery that allows later write-tree to reuse the already computed subtree object names, without having to recompute subtree object names until they are really necessary (i.e. when writing a tree out). By "avoiding" it, are you going to make write-tree always recompute all the subtree object names? Or are you planning to keep the cached tree information always up to date by recomputing subtree object names and keeping them in the index even when they are not yet needed? If the latter, how would you handle when a part of the index contains unmerged entries? Right now, the code that updates the in-core index records "Is the in-core index clean, or modified?" bit and nothing else. Without updating the in-core structure and the codepaths that access it, how is it possible for your phase I to selectively write out only the modified (either added, updated or removed) part of it? In other words, I do not think it is realistic to expect that the core parts to stay oblivious to the new index semantics during the "phase one". > -- Timeline -- > 24/04 - 12/05: Getting familiar with the old index It might make more sense to write the "proposed solution" *after* this step is done; otherwise you wouldn't even know the problems you are trying to solve. That may mean that this part of the schedule may need to be shortened or started way before the beginning of GSoC. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html