Re: GSoC - Designing a faster index format

elton sky <eltonsky9404@xxxxxxxxx> · Fri, 6 Apr 2012 13:13:34 +1000

Thank you everyone for ideas, clues, explanations and questions.
Collectively I wrote my proposal. This one is mostly based on Duy's
suggestion with minor changes.

Problem:
Current index store all files in a list. This implies that:
1. Each operation has to read the whole index and write the whole index back.
2. computing the checksum for the whole index.
These become expensive when the repo is large.

Requirement:
Suppose n is the number objects in a repo
-Keep the read/write time <= O(logn).
-Keep the checksum computed against only necessary objects.
-New format is easy to parse.
-Backward compatible.
-Potentially use faster hash method.

Proposed solution:

Store the repo structure in a canonical tree. Each directory is a tree
block. A tree block contains blobs and offsets to sub directories. It
has its own checksum. A read/write will be done on tree block base. A
tree block also contains an offset points to its previous version (if
there's one).

The root of offset is stored in trailer, which stays at the end of
file. Each update creates a new trailer which points to the new tree.
(details below)

To save the pain of modify a tree block and track the free spaces in
the index, changed blocks are appended at the end. Based on the
assumption that user won't change too many files each time, for an
updated file, all its parent blocks to Root was copied and appended to
index. In other words, all traversed blocks are copied. The offset of
previous version of the updated block is stored in the new block. A
new generation number is stored in copied and updated blocks. Other
blocks are not copied. They are referenced by offsets. After update a
new trailer is created at the end. In this way, there's no harm to
read, and makes write fast.

Trailer stores the offset of previous trailer. It makes tracking old
versions easy.

Each operation will load and rewrite the header and visited trailer .

Checksum for non identifier purpose will use crc32. Otherwise it uses sha1.

For compatibility, old format of index will be transformed to new
format in the first operation.

==
Index format:

- A 8-byte header consisting of

    4-byte signature:
      The signature is { 'T', 'R', 'E', 'E' }

    4-byte version number:
      The current supported versions are 4.

  - A number of blocks of variable size

     1-byte block type

     3-byte content size in byte

     block content

     4-byte crc32 of all above

  - A 20-byte trailer consisting of

     4-byte trailer signature:
       The signature is { 'R', 'O', 'O', 'T' }

     4-byte root block offset

     4-byte extension table offset:
       Zero means no extension

     4-byte offset to previous trailer

     4-byte checksum:
       CRC32 of the header and the trailer (excluding this field)

==
Tree block:

Tree block content is basically the list of entries in a specified
path, with all attributes we store in the index now. This entry list
is sorted by pathname. For doing a bsearch in the list, pathnames are
stored at the end of block, which makes the size of entry fixed. The
pathname is pointed from each entry with 2 byte offset (relative to a
block). This should not be problem as a block is never modified.

It stores the generation number and the offset to old block. It also
integrates the content of cache-tree.

A tree block content consists of

 - 4-byte tree object size

 - 20-byte SHA-1 of the cached tree object

- checkpoint : interleave with items in the block, 1 for every 100 items
	4-byte offset to next checkpoint

 - a list attributes corresponding to tree object's item, in the same
   order. These attributes are the same as in DIRC entry format
   except that entry name is removed, and a tree block offset is
   added in case the item is a directory.

   32-bit ctime seconds, the last time a file's metadata changed
     this is stat(2) data

   32-bit ctime nanosecond fractions
     this is stat(2) data

   32-bit mtime seconds, the last time a file's data changed
     this is stat(2) data

   32-bit mtime nanosecond fractions
     this is stat(2) data

   32-bit dev
     this is stat(2) data

   32-bit ino
     this is stat(2) data

   32-bit mode, split into (high to low bits)

     4-bit object type
       valid values in binary are 1000 (regular file), 1010 (symbolic link)
       and 1110 (gitlink)

     3-bit unused

     9-bit unix permission. Only 0755 and 0644 are valid for regular files.
     Symbolic links and gitlinks have value 0 in this field.

   32-bit uid
     this is stat(2) data

   32-bit gid
     this is stat(2) data

   32-bit file size
     This is the on-disk size from stat(2), truncated to 32-bit.

   160-bit SHA-1 for the represented object if blobs or the offset
     to another tree block if trees

   A 32-bit 'flags' field split into (high to low bits)

     1-bit assume-valid flag

     1-bit extended flag (must be zero in version 2)

     2-bit stage (during merge)

     12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
     is stored in this field.

     1-bit skip-worktree flag (used by sparse checkout)

     1-bit intent-to-add flag (used by "git add -N")

     14-bit unused, must be zero

      A 16-bit offset, relative to the beginning of this block, to the
pathname of this entry

- 2-byte generation number, starts from 1

- 4-byte previous version offset

 - a list of NUL-terminated pathnames, pointed to from the 16-bit offset
   above

== Extension table block

Extension table has block type 'X'. It consists of a series of 4-byte
extension block offset.

== Extension block

Extension block has block type 'E'. Extension content is the same as
in the old format.

Time line:

24/04 ~ 21/05: get familiar with code base and revise proposal
benchmark with linux kernel on major operations
write prototype to prove feasibility of proposed solution
consult mailing list & irc

22/05 ~ 25/06 write code, test and benchmark
	modify tree lib
	modify index format operations
	modify git operations
	transform from old to new

26/06 ~ 30/07 revise things according to benchmark

31/07 ~ 13/08 update documentation

About me

My name is Elton Tian, I am from China. I have been living in
Australia for quite a few years. I am currently a Master student from
Australia National University. After graduate I worked on linux based
web development (using tcl) for 2.5 years. I have been programming
with c, c#, java, tcl, php, javascript and shell script. But I prefer
c, which gives me the feeling of full control over the program. I am
interested in data intensive computing. I played with hadoop and
mapreduce since 2010. I maintained my 3 node cluster behind my desk.
As linus hates cvs with a passion, I still want to mention I was using
cvs in work, don't like it though. And I guess this is the good chance
to get over it.

On Thu, Apr 5, 2012 at 2:22 AM, elton sky <eltonsky9404@xxxxxxxxx> wrote:
> Hello,
>
> Some updates for Nguyen's index:
>
> On Wed, Apr 4, 2012 at 10:20 PM, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote:
>> On Wed, Apr 4, 2012 at 3:26 PM, elton sky <eltonsky9404@xxxxxxxxx> wrote:
>>> I am not sure how the trailer works.
>>> I assume there can be multiple trailers, each update will generate a
>>> new one. Every trailer will point to the root tree (i.e. all trailers
>>> point to the same block?). So if there are some changes to root, like
>>> rename, trailers all point to the latest root block?
>>
>> Each trailer points to the whole new tree. Because trees are
>> immutable, changing in a tree meangs creating a new one and will also
>> make a new parent tree (to point to the updated tree because old
>> parent will always point to old tree). This eventually leads to root
>> tree change, recorded by the trailer.
>>
>>> Is the index looks like :
>>> | HEADER | TREE BLOCKS | TRAILER |  TREE BLOCKS | TRAILER | TREE
>>> BLOCKS | TRAILER | ...
>>>
>
> Once an update happened to a block, all parent blocks to root will be
> copied to the end of the index. And the updated block will be at leaf
> of the new tree. Finding this path costs logn time anyway. This is no
> harm for read for the whole tree. But in order to find all previous
> changes to a block, we have to go through all trailers and trees.
>
> Otherwise, just modify the original tree. Let the parent points to the
> updated block and let the updated block points to the old block:
>
> parent
>    |
>   V
> updated            old                  old
> block (v3)  -->   block(v2)  -->  block (v1)
>    |
>   V
> child
> blocks
>
> A version number in a tree block is used to track the changes.
> In this way, there's still no harm to read, and it's more easy to
> trace the change history of a block. Also, we don't need to create
> interleaved blocks and trailers. There's only one trailer in the end
> of file. We also need to add another offset points to previous
> version.
>
> Trailer is kept at the end of index, as its size is variable. It
> contains offset to root and list of free spaces.
>
> Changes to format:
>
>> = The git index file has the following format
>
> As is, except there's only one trailer now. And trailer contains list
> of free spaces.
>
>>- A 18-byte trailer consisting of
>>
>>      4-byte trailer signature:
>>        The signature is { 'R', 'O', 'O', 'T' }
>>
>>      2-byte generation:
>>         The first trailer is 0, the second 1 and so on.
>>
>>      4-byte root block offset
>>
>>      4-byte extension table offset:
>>        Zero means no extension
>>
>        list of free spaces
>        - 4 byte offset
>        - 2 byte length
>
>>      4-byte checksum:
>>        CRC32 of the header and the trailer (excluding this field)
>
> Free space list is read/written in whole for each operation, together
> with trailer.
>
>>
>> == Tree block
>> ...
>
> Above as is.
>
> - 1 byte version num
>
> - 4 byte offset to previous version block
>
>>
>>    160-bit SHA-1 for the represented object if blobs or the offset
>>      to another tree block if trees
>>
>>    A 32-bit 'flags' field split into (high to low bits)
>>
>>      1-bit assume-valid flag
>>
>>      1-bit extended flag (must be zero in version 2)
>>
>>      2-bit stage (during merge)
>>
>>      12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
>>      is stored in this field.
>>
>>      1-bit skip-worktree flag (used by sparse checkout)
>>
>>      1-bit intent-to-add flag (used by "git add -N")
>>
>>      14-bit unused, must be zero
>>
>>    A 16-bit offset, relative to the beginning of this block, to the
>>      pathname of this entry. FIXME: make it 32-bit, relative to the
>>      beginning of the file, so that we can reuse pathnames from other
>>      (old) blocks?
>>
>
> -Elton
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html