Re: [GSoC] Designing a faster index format

Shawn Pearce <spearce@xxxxxxxxxxx> · Thu, 5 Apr 2012 20:22:26 -0700

On Thu, Apr 5, 2012 at 14:49, Thomas Rast <trast@xxxxxxxxxxx> wrote:
> Thomas Gummerer <italyhockeyfeed@xxxxxxxxx> writes:
>
>> -- Proposed solution --
>> The proposed solution is to redesign the index to a B-tree based format. This
>> allows changes to the index in O(log(n)) time, with n being the number of
>> entries in the index.
>
>> -- Solutions that were also considered --
>> - Append-only data structure
>> - Database format
>> - Padded structure
>
> This is quite complete already, which I think is great, but it's still
> missing one "obvious" approach: a directory-tree based layout that uses
> "flat" storage.  That is, the entries grouped by directory and thus
> arranged into the "natural" tree, so as to allow parsing only part of
> it.  But not pulling any tricks to make it easy to change; a nontrivial
> change would mean a rewrite.  How good do you think that could be?

I have been wondering this myself. Aren't most updates to the index
just updating the stat information of an existing entry?

If so we could structure the index as flat lists for each directory
similar to a canonical tree, but with a wider field to hold not just
the SHA-1 but also the stat information of each file. If the entry is
just the component name ("foo.c" and not "src/foo.c") and the SHA-1
and stat data, you can probably protect the entire entry with a CRC-32
for each entry. Updates can be performed in place by taking the write
lock with index.lock as an empty file, then overwriting the SHA-1 and
stat field followed, by updating the CRC-32. Readers that see the
wrong CRC-32 for an entry can sleep for a short period, retry the
read, and fail after some number of attempts if they cannot get a
valid read of that entry.

Adding a new file or deleting an existing file from the index would
require a full rewrite.

Within a single tree/directory entry, it probably doesn't have to be
binary searchable. Canonical trees aren't. Linear scans through a
directory is OK, so long as the scans are broken up by the directory
tree structure just like they are in canonical trees.

Dealing with the conflict stages during merges (1, 2, 3) could be
handled by appending the conflict data at the end of the index file,
when conflicts are resolved this tail region could be truncated back
to the real end of the file. A bit could be set on the normal entry in
the trees to denote there is a conflict, and additional stage data is
expected in the tail region of the file.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html