Re: GSoC - Designing a faster index format

Shawn Pearce <spearce@xxxxxxxxxxx> · Mon, 26 Mar 2012 09:08:12 -0700

On Mon, Mar 26, 2012 at 08:25, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote:
> On Mon, Mar 26, 2012 at 9:28 PM, Thomas Rast <trast@xxxxxxxxxxxxxxx> wrote:
>> elton sky <eltonsky9404@xxxxxxxxx> writes:
>>
>>> On Mon, Mar 26, 2012 at 12:06 PM, Nguyen Thai Ngoc Duy
>>> <pclouds@xxxxxxxxx> wrote:
>>>> (I think this should be on git@vger as there are many experienced devs there)
>>>>
>>>> On Sun, Mar 25, 2012 at 11:13 AM, elton sky <eltonsky9404@xxxxxxxxx> wrote:
>>>>> About the new format:
>>>>>
>>>>> The index is a single file. Entries in the index still stored
>>>>> sequentially as old format. The difference is they are grouped into
>>>>> blocks. A block contains many entries and they are ordered by names.
>>>>> Blocks are also ordered by the name of the first entry. Each block
>>>>> contains a sha1 for entries in it.
>>>>
>>>> If I remove an entry in the first block, because blocks are of fixed
>>>> size, you would need to shift all entries up by one, thus update all
>>>> blocks?
>>>
>>> We need some GC here. I am not moving all blocks. Rather I would
>>> consider merge or recycle the block. In a simple case if a block
>>> becomes empty, I ll change the offset of new block in the header point
>>> to this block, and make this block points to the original offset of
>>> new block. In this way, I keep the list of empty blocks I can reuse.
>> [...]
>>
>> Doesn't that venture into database land?
>>
>> If we go that far, wouldn't it be better to use a proper database
>> library?  All other things being equal, writing such complex code from
>> scratch is probably not a good idea.
>
> If there's a library that fits our needs (including linking
> statically). I think we've come close to sqlite file format [1]. But
> sqlite comes with sql engine, transactional updates... that we don't
> need. Another obvious source for inspiration is file systems, but I
> dare not go that way.
>
> [1] http://www.sqlite.org/fileformat2.html

Or use LevelDb[2]. Its BSD license. Uses an immutable file format, but
writes updates to new smaller files and eventually collapses
everything back together into a bigger file. This can be a
dramatically simpler approach than dealing with your own free block
system inside of a single file. Its only real downside is needing to
periodically pay a penalty to rewrite the whole index. But this
rewrite is going to be faster than the time it takes to rewrite the
pack files for the same repository, which git gc or git repack
handles. So I don't think its actually a problem for the index.

You might even be able to take a two level approach to compacting the
LevelDb database (or something like it). In a minor compaction you
compact all of the files except the huge base file, leaving you with 2
files. A huge base file that contains the first tree the user checked
out, and a second smaller file containing any differences they have
since the initial checkout (this may just be updated stat data for a
handful of files that differed across two branches as they switched
back and forth). During a git gc or git repack, add a new stage to
collapse the base file and everything else into a single new base file
as a major compaction.

[2] http://code.google.com/p/leveldb/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html