Re: GSoC - Designing a faster index format

elton sky <eltonsky9404@xxxxxxxxx> · Sat, 7 Apr 2012 18:29:49 +1000

.... just realize I have to register and submit my proposal on gsoc
website, rather than here.

I stupidly missed that ...

silly enough..

On Fri, Apr 6, 2012 at 1:15 PM, elton sky <eltonsky9404@xxxxxxxxx> wrote:
> NOTE:
> Please ignore the
>
>>- checkpoint : interleave with items in the block, 1 for every 100 items
>>       4-byte offset to next checkpoint
>
> That's irrelevant.
>
> Cheers,
> Elton
>
> On Fri, Apr 6, 2012 at 1:13 PM, elton sky <eltonsky9404@xxxxxxxxx> wrote:
>> Thank you everyone for ideas, clues, explanations and questions.
>> Collectively I wrote my proposal. This one is mostly based on Duy's
>> suggestion with minor changes.
>>
>>
>> Problem:
>> Current index store all files in a list. This implies that:
>> 1. Each operation has to read the whole index and write the whole index back.
>> 2. computing the checksum for the whole index.
>> These become expensive when the repo is large.
>>
>> Requirement:
>> Suppose n is the number objects in a repo
>> -Keep the read/write time <= O(logn).
>> -Keep the checksum computed against only necessary objects.
>> -New format is easy to parse.
>> -Backward compatible.
>> -Potentially use faster hash method.
>>
>> Proposed solution:
>>
>> Store the repo structure in a canonical tree. Each directory is a tree
>> block. A tree block contains blobs and offsets to sub directories. It
>> has its own checksum. A read/write will be done on tree block base. A
>> tree block also contains an offset points to its previous version (if
>> there's one).
>>
>> The root of offset is stored in trailer, which stays at the end of
>> file. Each update creates a new trailer which points to the new tree.
>> (details below)
>>
>> To save the pain of modify a tree block and track the free spaces in
>> the index, changed blocks are appended at the end. Based on the
>> assumption that user won't change too many files each time, for an
>> updated file, all its parent blocks to Root was copied and appended to
>> index. In other words, all traversed blocks are copied. The offset of
>> previous version of the updated block is stored in the new block. A
>> new generation number is stored in copied and updated blocks. Other
>> blocks are not copied. They are referenced by offsets. After update a
>> new trailer is created at the end. In this way, there's no harm to
>> read, and makes write fast.
>>
>> Trailer stores the offset of previous trailer. It makes tracking old
>> versions easy.
>>
>> Each operation will load and rewrite the header and visited trailer .
>>
>> Checksum for non identifier purpose will use crc32. Otherwise it uses sha1.
>>
>> For compatibility, old format of index will be transformed to new
>> format in the first operation.
>>
>> ==
>> Index format:
>>
>> - A 8-byte header consisting of
>>
>>    4-byte signature:
>>      The signature is { 'T', 'R', 'E', 'E' }
>>
>>    4-byte version number:
>>      The current supported versions are 4.
>>
>>  - A number of blocks of variable size
>>
>>     1-byte block type
>>
>>     3-byte content size in byte
>>
>>     block content
>>
>>     4-byte crc32 of all above
>>
>>  - A 20-byte trailer consisting of
>>
>>     4-byte trailer signature:
>>       The signature is { 'R', 'O', 'O', 'T' }
>>
>>     4-byte root block offset
>>
>>     4-byte extension table offset:
>>       Zero means no extension
>>
>>     4-byte offset to previous trailer
>>
>>     4-byte checksum:
>>       CRC32 of the header and the trailer (excluding this field)
>>
>> ==
>> Tree block:
>>
>> Tree block content is basically the list of entries in a specified
>> path, with all attributes we store in the index now. This entry list
>> is sorted by pathname. For doing a bsearch in the list, pathnames are
>> stored at the end of block, which makes the size of entry fixed. The
>> pathname is pointed from each entry with 2 byte offset (relative to a
>> block). This should not be problem as a block is never modified.
>>
>> It stores the generation number and the offset to old block. It also
>> integrates the content of cache-tree.
>>
>> A tree block content consists of
>>
>>  - 4-byte tree object size
>>
>>  - 20-byte SHA-1 of the cached tree object
>>
>> - checkpoint : interleave with items in the block, 1 for every 100 items
>>        4-byte offset to next checkpoint
>>
>>  - a list attributes corresponding to tree object's item, in the same
>>   order. These attributes are the same as in DIRC entry format
>>   except that entry name is removed, and a tree block offset is
>>   added in case the item is a directory.
>>
>>   32-bit ctime seconds, the last time a file's metadata changed
>>     this is stat(2) data
>>
>>   32-bit ctime nanosecond fractions
>>     this is stat(2) data
>>
>>   32-bit mtime seconds, the last time a file's data changed
>>     this is stat(2) data
>>
>>   32-bit mtime nanosecond fractions
>>     this is stat(2) data
>>
>>   32-bit dev
>>     this is stat(2) data
>>
>>   32-bit ino
>>     this is stat(2) data
>>
>>   32-bit mode, split into (high to low bits)
>>
>>     4-bit object type
>>       valid values in binary are 1000 (regular file), 1010 (symbolic link)
>>       and 1110 (gitlink)
>>
>>     3-bit unused
>>
>>     9-bit unix permission. Only 0755 and 0644 are valid for regular files.
>>     Symbolic links and gitlinks have value 0 in this field.
>>
>>   32-bit uid
>>     this is stat(2) data
>>
>>   32-bit gid
>>     this is stat(2) data
>>
>>   32-bit file size
>>     This is the on-disk size from stat(2), truncated to 32-bit.
>>
>>   160-bit SHA-1 for the represented object if blobs or the offset
>>     to another tree block if trees
>>
>>   A 32-bit 'flags' field split into (high to low bits)
>>
>>     1-bit assume-valid flag
>>
>>     1-bit extended flag (must be zero in version 2)
>>
>>     2-bit stage (during merge)
>>
>>     12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
>>     is stored in this field.
>>
>>     1-bit skip-worktree flag (used by sparse checkout)
>>
>>     1-bit intent-to-add flag (used by "git add -N")
>>
>>     14-bit unused, must be zero
>>
>>      A 16-bit offset, relative to the beginning of this block, to the
>> pathname of this entry
>>
>> - 2-byte generation number, starts from 1
>>
>> - 4-byte previous version offset
>>
>>  - a list of NUL-terminated pathnames, pointed to from the 16-bit offset
>>   above
>>
>> == Extension table block
>>
>> Extension table has block type 'X'. It consists of a series of 4-byte
>> extension block offset.
>>
>> == Extension block
>>
>> Extension block has block type 'E'. Extension content is the same as
>> in the old format.
>>
>>
>>
>> Time line:
>>
>> 24/04 ~ 21/05: get familiar with code base and revise proposal
>> benchmark with linux kernel on major operations
>> write prototype to prove feasibility of proposed solution
>> consult mailing list & irc
>>
>> 22/05 ~ 25/06 write code, test and benchmark
>>        modify tree lib
>>        modify index format operations
>>        modify git operations
>>        transform from old to new
>>
>> 26/06 ~ 30/07 revise things according to benchmark
>>
>> 31/07 ~ 13/08 update documentation
>>
>>
>> About me
>>
>> My name is Elton Tian, I am from China. I have been living in
>> Australia for quite a few years. I am currently a Master student from
>> Australia National University. After graduate I worked on linux based
>> web development (using tcl) for 2.5 years. I have been programming
>> with c, c#, java, tcl, php, javascript and shell script. But I prefer
>> c, which gives me the feeling of full control over the program. I am
>> interested in data intensive computing. I played with hadoop and
>> mapreduce since 2010. I maintained my 3 node cluster behind my desk.
>> As linus hates cvs with a passion, I still want to mention I was using
>> cvs in work, don't like it though. And I guess this is the good chance
>> to get over it.
>>
>> On Thu, Apr 5, 2012 at 2:22 AM, elton sky <eltonsky9404@xxxxxxxxx> wrote:
>>> Hello,
>>>
>>> Some updates for Nguyen's index:
>>>
>>> On Wed, Apr 4, 2012 at 10:20 PM, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote:
>>>> On Wed, Apr 4, 2012 at 3:26 PM, elton sky <eltonsky9404@xxxxxxxxx> wrote:
>>>>> I am not sure how the trailer works.
>>>>> I assume there can be multiple trailers, each update will generate a
>>>>> new one. Every trailer will point to the root tree (i.e. all trailers
>>>>> point to the same block?). So if there are some changes to root, like
>>>>> rename, trailers all point to the latest root block?
>>>>
>>>> Each trailer points to the whole new tree. Because trees are
>>>> immutable, changing in a tree meangs creating a new one and will also
>>>> make a new parent tree (to point to the updated tree because old
>>>> parent will always point to old tree). This eventually leads to root
>>>> tree change, recorded by the trailer.
>>>>
>>>>> Is the index looks like :
>>>>> | HEADER | TREE BLOCKS | TRAILER |  TREE BLOCKS | TRAILER | TREE
>>>>> BLOCKS | TRAILER | ...
>>>>>
>>>
>>> Once an update happened to a block, all parent blocks to root will be
>>> copied to the end of the index. And the updated block will be at leaf
>>> of the new tree. Finding this path costs logn time anyway. This is no
>>> harm for read for the whole tree. But in order to find all previous
>>> changes to a block, we have to go through all trailers and trees.
>>>
>>> Otherwise, just modify the original tree. Let the parent points to the
>>> updated block and let the updated block points to the old block:
>>>
>>> parent
>>>    |
>>>   V
>>> updated            old                  old
>>> block (v3)  -->   block(v2)  -->  block (v1)
>>>    |
>>>   V
>>> child
>>> blocks
>>>
>>> A version number in a tree block is used to track the changes.
>>> In this way, there's still no harm to read, and it's more easy to
>>> trace the change history of a block. Also, we don't need to create
>>> interleaved blocks and trailers. There's only one trailer in the end
>>> of file. We also need to add another offset points to previous
>>> version.
>>>
>>> Trailer is kept at the end of index, as its size is variable. It
>>> contains offset to root and list of free spaces.
>>>
>>> Changes to format:
>>>
>>>> = The git index file has the following format
>>>
>>> As is, except there's only one trailer now. And trailer contains list
>>> of free spaces.
>>>
>>>>- A 18-byte trailer consisting of
>>>>
>>>>      4-byte trailer signature:
>>>>        The signature is { 'R', 'O', 'O', 'T' }
>>>>
>>>>      2-byte generation:
>>>>         The first trailer is 0, the second 1 and so on.
>>>>
>>>>      4-byte root block offset
>>>>
>>>>      4-byte extension table offset:
>>>>        Zero means no extension
>>>>
>>>        list of free spaces
>>>        - 4 byte offset
>>>        - 2 byte length
>>>
>>>>      4-byte checksum:
>>>>        CRC32 of the header and the trailer (excluding this field)
>>>
>>> Free space list is read/written in whole for each operation, together
>>> with trailer.
>>>
>>>>
>>>> == Tree block
>>>> ...
>>>
>>> Above as is.
>>>
>>> - 1 byte version num
>>>
>>> - 4 byte offset to previous version block
>>>
>>>>
>>>>    160-bit SHA-1 for the represented object if blobs or the offset
>>>>      to another tree block if trees
>>>>
>>>>    A 32-bit 'flags' field split into (high to low bits)
>>>>
>>>>      1-bit assume-valid flag
>>>>
>>>>      1-bit extended flag (must be zero in version 2)
>>>>
>>>>      2-bit stage (during merge)
>>>>
>>>>      12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
>>>>      is stored in this field.
>>>>
>>>>      1-bit skip-worktree flag (used by sparse checkout)
>>>>
>>>>      1-bit intent-to-add flag (used by "git add -N")
>>>>
>>>>      14-bit unused, must be zero
>>>>
>>>>    A 16-bit offset, relative to the beginning of this block, to the
>>>>      pathname of this entry. FIXME: make it 32-bit, relative to the
>>>>      beginning of the file, so that we can reuse pathnames from other
>>>>      (old) blocks?
>>>>
>>>
>>> -Elton
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html