Re: GSoC - Designing a faster index format

elton sky <eltonsky9404@xxxxxxxxx> · Fri, 23 Mar 2012 21:27:32 +1100

Hi Nguyen, Jakub

Thank you for your explanations.

Just clarify question about track updated files:

On Fri, Mar 23, 2012 at 12:30 PM, Nguyen Thai Ngoc Duy
<pclouds@xxxxxxxxx> wrote:
> On Fri, Mar 23, 2012 at 3:32 AM, elton sky <eltonsky9404@xxxxxxxxx> wrote:
>> Got a few questions:
>>
>> 1. index is used for building next commit, so it should only include
>> files created/modified/deleted. But I see it has all entries for
>> current working dir. why?
>
> Jakub has answered this question.
>
>> 2. From read_index_from() I see the whole index is read into mem, and
>> write one by one (entry/ext) back to disk. This makes sense. But why
>> we have to compute Sha1 for all entries, especially unchanged entries?
>
> To catch disk corruption. If a bit is flipped anywhere in the index
> and we do not detect it, we may end up creating broken commits.
>
>> 3. how does git track updated files? Does it compare the ts between
>> working dir and index ? Or they are recorded somewhere?
>
> Check out refresh_cache_ent. At the beginning of most commands, they
> call refresh_index() or refresh_cache(), which checks a file's mtime
> against one stored in index (different means updated). In the worst
> scenario, refresh_cache_ent may call ce_compare_data(), which computes
> SHA-1 of the specified file and compare it with one stored in index.
>

This means working dir will compare each entry in index on mtime
field, to find out  if it's updated. The complexity for this operation
is O(nlogn). I assume the way of this checking is: it loops through
entries in the index, for each entry, it searches in working dir and
compare the mtime.

Because current index is a single steam of file, when it writes back
it has to write back everything sequentially. So we have to do
checksum for every entry. And I suppose this process is more time
consuming than previous step.

If we use a tree format, still, when looking for updated files, time
complexity is O(nlogn), i.e. we traverse the index entries and for
each entry we refer back to working dir. However, when we write index
back, we only need to recompute and write updated file nodes, but not
all entries. Total processing time benefit from here.

Please correct me if I am wrong.

-Elton

>> 4. When does git insert to cache tree? and when it retrieve from it?
>
> cache-tree is built from scratch in some cases, when we know HEAD (or
> some tree) matches index exactly (e.g. reset --hard). Usually it's
> only built up at commit time (update_main_cache_tree in
> builtin/commit.c).
> --
> Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html