Re: [PATCH/RFC v2 0/16] Introduce index file format version 5

Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> · Tue, 7 Aug 2012 19:16:25 +0700

On Tue, Aug 7, 2012 at 12:46 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> The index_ops abstraction is a right way to go, and I like it, but I
> think the split illustrated in this patch might turn out to be at
> wrong levels (and it is OK, as I understand this is a illustration
> of concept patch).
>
> For example, add_to_index() interface may be a good candidate to
> have in index_ops.  Because your in-core index may not be holding
> everything in a flat array, "find the location in the flat array the
> entry would sit, replace the existing one if there is any, otherwise
> insert" cannot be a generic way to add a new entry.  If you make the
> whole thing an abstract API entry point, a sparse implementation of
> the in-core index could still implement it without bringing the
> untouched and irrelevant parts of the index to core.
>
>         Side note: with a tree-like implementation of the in-core
>         index, "find the location the entry would sit", "get the
>         entry at the location", "insert the entry at the location",
>         could still be a set of good abstract API, though.  The
>         definition of _location_ may be quite different from "the
>         offset of the entry counting from the beginning of a flat
>         array", which is what index_name_pos() returns.
>
> The story is the same on the removal front.  The current
> remove_index_entry_at() interface is tied to the flat array
> implementation, so "remove the nth entry from the beginning" is an
> inappropriate interface for anything but such an implementation
> (unless we come up with an abstract notion of the "location" that is
> usable efficiently in a tree-like implementation, that is).

add_to_index and remove_index_entry_at seem good places for the cut.
But do we need to redefine the location? I think we need to sketch out
a long term plan first. In my mind it's like this:

 - for 3-5 years since v5 is released, we support v2 and v5 in
parallel. Other code can take advantage of v5, but it must neither
sacrifice v2 performance, compatibility nor maintainability
 - after that, we deprecate v2. v2 is automatically converted to v5 in
memory. v2 perf may suffer but at that point we don't care any more as
the majority of users should have been migrated to v5 (*)

If the long term plan is actually that, we will need to stick to flat
array implementation for forseeable future as moving from it most
likely impacts v2 performance. When v5 is used, it must maintain two
views, tree and list, at the same time. We can then postpone thinking
about the redefinition until v2 is deprecated and in-core moved to
tree view only.

This might not be the best way forward as v2 incompatible features
(like keeping empty directories in index, what else?) may never come
until v2 is deprecated.

(*) this is questionable though. Depending on the benchmarks, we may
want to support both v2 and v5 for indefinite time with v2 recommended
for small projects and v5 the rest. If it's so, yeah we need to think
of better API now.

> I wish that the development of this topic was done more in a
> top-down direction, instead of bottom-up, so that it identified the
> necessary access patterns to the in-core index early and come up
> with a good set of abstract API first, and then only after that is
> done, came up with in-core and on-disk format to support the
> necessary operations.

Yeah, which is why I asked to try out partial reading/writing early as
I'm a learn by example kind of guy. Speaking of which, now that we
have something substantial, what should be done before this may be
considered for 'next'?

I don't think we should wait until it reaches full potential (i.e.
significant perf gain from all major index-related commands). Apart
from patch preparation, more testing and benchmarking, should we wait
until we get new public API or just use current index API?

One API addition that I (if nobody else) will do soon is
read_index_partial(<pathspec>) and adapt as many read-only commands as
possible to it. (v2 just ignores the pathspec input and loads the
whole thing, so all commands must be aware the the loaded may be more
than what they asked). But this can wait until v5 gets in.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html