Re: [PATCH/RFC v2 0/16] Introduce index file format version 5

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 08 Aug 2012 09:31:37 -0700

Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> writes:

> OK how about this. The general idea is preserve/extend current flat
> index API and add a new (tree-based) one. Index users can use either.
> They can even mix them up (which they do because we can't just flip
> the API in one day for about 200 source files).
>
> The day that unpack_trees() is converted to tree API, I will declare
> v5 victory ;)

s/API, /API and benchmark says tree-shaped index is an overall win, /;

> = Cleanup =
>
> struct cache_entry becomes partly opaque. ce_ctime..ce_gid are hidden
> in -v2.c and -v5.c. We only expose ce_size, ce_flags, ce_namelen, name
> and sha1 to index users. Extra v5 fields like ce_stat_crc, next and
> dir_next are also hidden. These fields can be put in a real struct in
> read-cache.h, which is supposedly included by -v2.c and -v5.c

I do not particularly see a reason to keep different in-core
cache_entry representations even in an early round of the API
updates.  If v2 needs ctime and gid and v5 needs crc, keep both
fields for simplicity.  When coming from the filesystem, ctime, gid
and friends are immediately available and crc needs to be computed
only immediately before it is written out or it is compared with an
existing entry.

I also do not see a reason to keep two representations of in-core
index_state representations for that matter.

The current code that access nth entry from the index->cache[nth]
would need to be updated to use an accessor function, whether the
"nth" comes from index_name_pos() or from the for-loop that iterates
over the entire index.  For the latter, you would need to give the
users a function that returns a cursor into the in-core index to
allow iterating over it.

When you use an in-core representation that is not a flat array, the
type of "nth", which is essentially a cursor, may have to change to
something that is richer than a simple integer, in order to give the
implementation of the in-core index a more efficient way to access
the entry than traversing the leaves of the tree depth first, and
you would need to update index_name_pos() to return such a "cursor".
That design and development cost is part of updating the in-core
data structure. In the end result, the runtime cost to manipulate an
index entry that the cursor refers to should be minimum, as that
would be the cost paid by all the users of the API anyway, even if
we _were_ starting from an ideal world where there weren't any flat
in-core index in the first place.

Because the v2 on-disk format forces us to scan the whole thing at
least once, with a properly designed in-core representation, the
overall system would not suffer performance penalty when reading
from v2, as both the current code and the updated code have to read
everything, and accesses based on the cursor given by either
index_name_pos() or the index iterator has to be fast anyway (if the
latter does not hold true, your updated in-core representation that
is not a flat array needs to be rethought).

On top of such a solid foundation, we can map the updated in-core
representation to an on-disk representation with confidence, as any
performance improvement or degradation from that point on must be
solely attributable to the on-disk format difference.

Without such a foundation, it is hard to justify a different on-disk
format without handwaving, no?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html