Re: GSoC - Designing a faster index format

elton sky <eltonsky9404@xxxxxxxxx> · Fri, 6 Apr 2012 13:15:44 +1000

NOTE:
Please ignore the

>- checkpoint : interleave with items in the block, 1 for every 100 items
>       4-byte offset to next checkpoint

That's irrelevant.

Cheers,
Elton

On Fri, Apr 6, 2012 at 1:13 PM, elton sky <eltonsky9404@xxxxxxxxx> wrote:
> Thank you everyone for ideas, clues, explanations and questions.
> Collectively I wrote my proposal. This one is mostly based on Duy's
> suggestion with minor changes.
>
>
> Problem:
> Current index store all files in a list. This implies that:
> 1. Each operation has to read the whole index and write the whole index back.
> 2. computing the checksum for the whole index.
> These become expensive when the repo is large.
>
> Requirement:
> Suppose n is the number objects in a repo
> -Keep the read/write time <= O(logn).
> -Keep the checksum computed against only necessary objects.
> -New format is easy to parse.
> -Backward compatible.
> -Potentially use faster hash method.
>
> Proposed solution:
>
> Store the repo structure in a canonical tree. Each directory is a tree
> block. A tree block contains blobs and offsets to sub directories. It
> has its own checksum. A read/write will be done on tree block base. A
> tree block also contains an offset points to its previous version (if
> there's one).
>
> The root of offset is stored in trailer, which stays at the end of
> file. Each update creates a new trailer which points to the new tree.
> (details below)
>
> To save the pain of modify a tree block and track the free spaces in
> the index, changed blocks are appended at the end. Based on the
> assumption that user won't change too many files each time, for an
> updated file, all its parent blocks to Root was copied and appended to
> index. In other words, all traversed blocks are copied. The offset of
> previous version of the updated block is stored in the new block. A
> new generation number is stored in copied and updated blocks. Other
> blocks are not copied. They are referenced by offsets. After update a
> new trailer is created at the end. In this way, there's no harm to
> read, and makes write fast.
>
> Trailer stores the offset of previous trailer. It makes tracking old
> versions easy.
>
> Each operation will load and rewrite the header and visited trailer .
>
> Checksum for non identifier purpose will use crc32. Otherwise it uses sha1.
>
> For compatibility, old format of index will be transformed to new
> format in the first operation.
>
> ==
> Index format:
>
> - A 8-byte header consisting of
>
>    4-byte signature:
>      The signature is { 'T', 'R', 'E', 'E' }
>
>    4-byte version number:
>      The current supported versions are 4.
>
>  - A number of blocks of variable size
>
>     1-byte block type
>
>     3-byte content size in byte
>
>     block content
>
>     4-byte crc32 of all above
>
>  - A 20-byte trailer consisting of
>
>     4-byte trailer signature:
>       The signature is { 'R', 'O', 'O', 'T' }
>
>     4-byte root block offset
>
>     4-byte extension table offset:
>       Zero means no extension
>
>     4-byte offset to previous trailer
>
>     4-byte checksum:
>       CRC32 of the header and the trailer (excluding this field)
>
> ==
> Tree block:
>
> Tree block content is basically the list of entries in a specified
> path, with all attributes we store in the index now. This entry list
> is sorted by pathname. For doing a bsearch in the list, pathnames are
> stored at the end of block, which makes the size of entry fixed. The
> pathname is pointed from each entry with 2 byte offset (relative to a
> block). This should not be problem as a block is never modified.
>
> It stores the generation number and the offset to old block. It also
> integrates the content of cache-tree.
>
> A tree block content consists of
>
>  - 4-byte tree object size
>
>  - 20-byte SHA-1 of the cached tree object
>
> - checkpoint : interleave with items in the block, 1 for every 100 items
>        4-byte offset to next checkpoint
>
>  - a list attributes corresponding to tree object's item, in the same
>   order. These attributes are the same as in DIRC entry format
>   except that entry name is removed, and a tree block offset is
>   added in case the item is a directory.
>
>   32-bit ctime seconds, the last time a file's metadata changed
>     this is stat(2) data
>
>   32-bit ctime nanosecond fractions
>     this is stat(2) data
>
>   32-bit mtime seconds, the last time a file's data changed
>     this is stat(2) data
>
>   32-bit mtime nanosecond fractions
>     this is stat(2) data
>
>   32-bit dev
>     this is stat(2) data
>
>   32-bit ino
>     this is stat(2) data
>
>   32-bit mode, split into (high to low bits)
>
>     4-bit object type
>       valid values in binary are 1000 (regular file), 1010 (symbolic link)
>       and 1110 (gitlink)
>
>     3-bit unused
>
>     9-bit unix permission. Only 0755 and 0644 are valid for regular files.
>     Symbolic links and gitlinks have value 0 in this field.
>
>   32-bit uid
>     this is stat(2) data
>
>   32-bit gid
>     this is stat(2) data
>
>   32-bit file size
>     This is the on-disk size from stat(2), truncated to 32-bit.
>
>   160-bit SHA-1 for the represented object if blobs or the offset
>     to another tree block if trees
>
>   A 32-bit 'flags' field split into (high to low bits)
>
>     1-bit assume-valid flag
>
>     1-bit extended flag (must be zero in version 2)
>
>     2-bit stage (during merge)
>
>     12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
>     is stored in this field.
>
>     1-bit skip-worktree flag (used by sparse checkout)
>
>     1-bit intent-to-add flag (used by "git add -N")
>
>     14-bit unused, must be zero
>
>      A 16-bit offset, relative to the beginning of this block, to the
> pathname of this entry
>
> - 2-byte generation number, starts from 1
>
> - 4-byte previous version offset
>
>  - a list of NUL-terminated pathnames, pointed to from the 16-bit offset
>   above
>
> == Extension table block
>
> Extension table has block type 'X'. It consists of a series of 4-byte
> extension block offset.
>
> == Extension block
>
> Extension block has block type 'E'. Extension content is the same as
> in the old format.
>
>
>
> Time line:
>
> 24/04 ~ 21/05: get familiar with code base and revise proposal
> benchmark with linux kernel on major operations
> write prototype to prove feasibility of proposed solution
> consult mailing list & irc
>
> 22/05 ~ 25/06 write code, test and benchmark
>        modify tree lib
>        modify index format operations
>        modify git operations
>        transform from old to new
>
> 26/06 ~ 30/07 revise things according to benchmark
>
> 31/07 ~ 13/08 update documentation
>
>
> About me
>
> My name is Elton Tian, I am from China. I have been living in
> Australia for quite a few years. I am currently a Master student from
> Australia National University. After graduate I worked on linux based
> web development (using tcl) for 2.5 years. I have been programming
> with c, c#, java, tcl, php, javascript and shell script. But I prefer
> c, which gives me the feeling of full control over the program. I am
> interested in data intensive computing. I played with hadoop and
> mapreduce since 2010. I maintained my 3 node cluster behind my desk.
> As linus hates cvs with a passion, I still want to mention I was using
> cvs in work, don't like it though. And I guess this is the good chance
> to get over it.
>
> On Thu, Apr 5, 2012 at 2:22 AM, elton sky <eltonsky9404@xxxxxxxxx> wrote:
>> Hello,
>>
>> Some updates for Nguyen's index:
>>
>> On Wed, Apr 4, 2012 at 10:20 PM, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote:
>>> On Wed, Apr 4, 2012 at 3:26 PM, elton sky <eltonsky9404@xxxxxxxxx> wrote:
>>>> I am not sure how the trailer works.
>>>> I assume there can be multiple trailers, each update will generate a
>>>> new one. Every trailer will point to the root tree (i.e. all trailers
>>>> point to the same block?). So if there are some changes to root, like
>>>> rename, trailers all point to the latest root block?
>>>
>>> Each trailer points to the whole new tree. Because trees are
>>> immutable, changing in a tree meangs creating a new one and will also
>>> make a new parent tree (to point to the updated tree because old
>>> parent will always point to old tree). This eventually leads to root
>>> tree change, recorded by the trailer.
>>>
>>>> Is the index looks like :
>>>> | HEADER | TREE BLOCKS | TRAILER |  TREE BLOCKS | TRAILER | TREE
>>>> BLOCKS | TRAILER | ...
>>>>
>>
>> Once an update happened to a block, all parent blocks to root will be
>> copied to the end of the index. And the updated block will be at leaf
>> of the new tree. Finding this path costs logn time anyway. This is no
>> harm for read for the whole tree. But in order to find all previous
>> changes to a block, we have to go through all trailers and trees.
>>
>> Otherwise, just modify the original tree. Let the parent points to the
>> updated block and let the updated block points to the old block:
>>
>> parent
>>    |
>>   V
>> updated            old                  old
>> block (v3)  -->   block(v2)  -->  block (v1)
>>    |
>>   V
>> child
>> blocks
>>
>> A version number in a tree block is used to track the changes.
>> In this way, there's still no harm to read, and it's more easy to
>> trace the change history of a block. Also, we don't need to create
>> interleaved blocks and trailers. There's only one trailer in the end
>> of file. We also need to add another offset points to previous
>> version.
>>
>> Trailer is kept at the end of index, as its size is variable. It
>> contains offset to root and list of free spaces.
>>
>> Changes to format:
>>
>>> = The git index file has the following format
>>
>> As is, except there's only one trailer now. And trailer contains list
>> of free spaces.
>>
>>>- A 18-byte trailer consisting of
>>>
>>>      4-byte trailer signature:
>>>        The signature is { 'R', 'O', 'O', 'T' }
>>>
>>>      2-byte generation:
>>>         The first trailer is 0, the second 1 and so on.
>>>
>>>      4-byte root block offset
>>>
>>>      4-byte extension table offset:
>>>        Zero means no extension
>>>
>>        list of free spaces
>>        - 4 byte offset
>>        - 2 byte length
>>
>>>      4-byte checksum:
>>>        CRC32 of the header and the trailer (excluding this field)
>>
>> Free space list is read/written in whole for each operation, together
>> with trailer.
>>
>>>
>>> == Tree block
>>> ...
>>
>> Above as is.
>>
>> - 1 byte version num
>>
>> - 4 byte offset to previous version block
>>
>>>
>>>    160-bit SHA-1 for the represented object if blobs or the offset
>>>      to another tree block if trees
>>>
>>>    A 32-bit 'flags' field split into (high to low bits)
>>>
>>>      1-bit assume-valid flag
>>>
>>>      1-bit extended flag (must be zero in version 2)
>>>
>>>      2-bit stage (during merge)
>>>
>>>      12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
>>>      is stored in this field.
>>>
>>>      1-bit skip-worktree flag (used by sparse checkout)
>>>
>>>      1-bit intent-to-add flag (used by "git add -N")
>>>
>>>      14-bit unused, must be zero
>>>
>>>    A 16-bit offset, relative to the beginning of this block, to the
>>>      pathname of this entry. FIXME: make it 32-bit, relative to the
>>>      beginning of the file, so that we can reuse pathnames from other
>>>      (old) blocks?
>>>
>>
>> -Elton
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html