Re: GSoC - Designing a faster index format

Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> · Fri, 23 Mar 2012 18:24:28 +0700

On Fri, Mar 23, 2012 at 5:27 PM, elton sky <eltonsky9404@xxxxxxxxx> wrote:
> On Fri, Mar 23, 2012 at 12:30 PM, Nguyen Thai Ngoc Duy
> <pclouds@xxxxxxxxx> wrote:
>> On Fri, Mar 23, 2012 at 3:32 AM, elton sky <eltonsky9404@xxxxxxxxx> wrote:
>>> 3. how does git track updated files? Does it compare the ts between
>>> working dir and index ? Or they are recorded somewhere?
>>
>> Check out refresh_cache_ent. At the beginning of most commands, they
>> call refresh_index() or refresh_cache(), which checks a file's mtime
>> against one stored in index (different means updated). In the worst
>> scenario, refresh_cache_ent may call ce_compare_data(), which computes
>> SHA-1 of the specified file and compare it with one stored in index.
>>
>
> This means working dir will compare each entry in index on mtime
> field, to find out  if it's updated. The complexity for this operation
> is O(nlogn). I assume the way of this checking is: it loops through
> entries in the index, for each entry, it searches in working dir and
> compare the mtime.
>
> Because current index is a single steam of file, when it writes back
> it has to write back everything sequentially. So we have to do
> checksum for every entry. And I suppose this process is more time
> consuming than previous step.

The previous step is pretty fast on Linux in hot cache case. I don't
think we need to care about that. Some commands only care a
subdirectory (for example "git diff -- path/to/here") and only refresh
entries within that subdirectory, which further reduces refresh cost.

Which reminds me, we cannot abandon current index format. Users should
be allowed to choose which format to use. It may be hard to keep the
code support two formats while still taking advantage of the new one.
Maybe you could internally convert old format to new one in memory so
that git code only has to deal with one format, but that adds more
cost on using old format. I don't know..

> If we use a tree format, still, when looking for updated files, time
> complexity is O(nlogn), i.e. we traverse the index entries and for
> each entry we refer back to working dir. However, when we write index
> back, we only need to recompute and write updated file nodes, but not
> all entries. Total processing time benefit from here.

Yes. And if you compute checksum per-entry, not as a whole file, then
when you read an entry, you only need to verify checksum of that entry
(and index header of course). That reduces reading cost. But that
increases space (20-byte per entry if you stick with SHA-1). Maybe we
could do checksum per group (or tree) instead as a trade off between
space/time.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html