Re: [GSoC] Designing a faster index format

Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> · Fri, 30 Mar 2012 12:19:48 +0700

On Fri, Mar 30, 2012 at 4:06 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Is it really the time consuming part in the overall picture?  Do you have
> vital statistics for various pieces to justify that you are solving the
> right problem?  E.g. (not exhaustive)
>
>  - read_index(): overhead to
>   - validate the checksum of the whole index;
>   - extract entries into the_index.cache[] array;
>
>  - write_index(): overhead to
>   - serialize entries into the on-disk format;
>   - compute the checksum over the entire index;
>   - write the whole index to disk;
>
>  - frequency of the above two operations.

Also maybe the frequency of entry updates vs additions/removals. I
suspect refresh operation in some case can update a lot of entries. If
that's the case (and happens often), we may need special treatment for
it because simply appending entries might be costly.

> Also, optimizing the write codepath by penalizing the read codepath is a
> wrong trade-off if the index is read far more often than it is written
> during a single day of a developer.

I suspect so too, but some measurement has to be done there. It'd be
good if you provide a patch to collect index operation statistics.
Some of us can try it on for a few weeks. That would give us a better
picture.

On Thu, Mar 29, 2012 at 10:21 PM, Thomas Gummerer <t.gummerer@xxxxxxxxx> wrote:
> - Tree structure
> The advantage of the tree structure over the append-only data structure that was
> proposed would be that it would not require any merge or other maintainance work
> work after a change of a file. The disadvantage however is that changing a file
> would always require log(n) changes to the index, where n is the number of
> entries in the index. Another problem might be the integrity check, which would
> need either to use a hash over the whole index file or a hash on the tree,
> which would however take more time for checking.

I'd say it takes less time for checksuming because we only verify the
trees we read. And tree-based structure allows us to read just a
subdirectory, an advantage for multi-project repositories where people
stay in a subdir most of the time. Shawn suspected crc32 checksum over
a large chunk of data may be insufficient. By hashing tree by tree,
the chunks are significantly shorter, making crc32 viable and cheap
candidate compared to sha-1 (also note that crc32 is used to verify
compressed objects in a pack)
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html