Re: GSoC - Designing a faster index format

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Apr 2, 2012 at 9:27 PM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
> On Mon, Apr 2, 2012 at 08:31, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote:
>> The index size will increase continuously. At some point, we will need
>> to repack it. Let assume a tree block is 64k on average and a path
>> generally consists of 3 path components.  That means an entry update
>> adds 192k and we can do about 80 updates before index reaches 16M (in
>> addition to initial index size).
>
> Only 3 path components? Java sources can easily have 8-10 with a long
> Maven and Java package implied prefix. This will increase the
> frequency of rewrites of the index file.

Yes, but in java case, users could adjust the rewrite limits to make
it less often. The index will be bigger, but because we mmap it and
only access parts of it, index size does not matter much.

>> At 16M or when trailer generation hits a limit (the limit can be
>> configurable), we rewrite the index to reduce its size. Some heavy
>> operations can also be used to rewrite index, such as checkout or
>> reset.
>>
>> The index integrity is verified by crc32. One crc32 covers header and
>> trailer. Each block has its own crc32. When the index is found
>> corrupt, we could try to roll back to latest good version by looking
>> for trailers from bottom up. Even when the index is not corrupt, users
>> can still look back this way for older index versions.
>
> How do you deal with a partially written append to the index file?
> E.g. if a prior update crashes or the filesystem doesn't write
> everything out before power failure, you need to find the last good
> trailer block in the file.

By looking for the trailer signature "ROOT" from bottom up, then
verify if it's still good (i.e. verifying all trees) from there.
Repeat until we find a good one.

>> = The git index file has the following format
>>
>>   - A 8-byte header consisting of
>>
>>     4-byte signature:
>>       The signature is { 'T', 'R', 'E', 'E' }
>>
>>     4-byte version number:
>>       The current supported versions are 1.
>
> Why not DIRC version 4?

I thought of that, but because I don't keep header format the same as
v3, I thought signature should change too. But this is really not
important at this stage.

>>   - A number of blocks of variable size
>>
>>      1-byte block type
>>
>>      3-byte content size in byte
>>
>>      block content
>
> So you are limiting the size of a canonical tree now? Currently there
> is no limit on the size a tree. But here the entire index structure
> plus set of names must be under 16 MiB. Granted no project probably
> hits that limit, but you are painting us into a corner with an upper
> limit here that doesn't look like it will be easy to increase.

We can introduce a new block type, not a nice approach though. Not
saving block size is probably ok too. We would need something to mark
end-of-block. I wanted to save block size to do crc32 quickly without
parsing the block, but instead, we could make block parsing faster and
not worry about it.

>> == Tree block
>>
>>  A tree block contains a (maybe invalid) tree object and extra
>>  information of its companion in working directory. Tree block has
>>  block type 'T'.
>>
>>  Tree block content is basically the list of non-recursive entries in
>>  specified path, with all attributes we store in the index now. There
>>  are a few changes though to intergrate cache-tree and allow
>>  bsearch() on mmap'd block.
>>
>>  A tree block content consists of
>>
>>  - 4-byte tree object size
>>
>>  - 20-byte SHA-1 of the cached tree object
>>
>>  - a list attributes corresponding to tree object's item, in the same
>>    order.  These attributes are the same as in DIRC entry format
>>    except that entry name is removed, and a tree block offset is
>>    added in case the item is a directory.
>>
>> ...
>>
>>    A 16-bit offset, relative to the beginning of this block, to the
>>      pathname of this entry. FIXME: make it 32-bit, relative to the
>>      beginning of the file, so that we can reuse pathnames from other
>>      (old) blocks?
>
> 16 bit offset doesn't work well in a block that can be as large as 2^24.

No it doesn't. That's the implication of using 16-bit offsets.

> If you reuse a path name list at the start of the file, how do you
> handle new names?

We don't drop this part even if we reuse a few path names elsewhere.
New path names can be put here. But I'm not really sure if reusing
names gains us anything because at least it breaks locality and
complicates handling code. We already save quite a bit by not
duplicating parent prefix.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]