Thank you everyone for ideas, clues, explanations and questions. Collectively I wrote my proposal. This one is mostly based on Duy's suggestion with minor changes. Problem: Current index store all files in a list. This implies that: 1. Each operation has to read the whole index and write the whole index back. 2. computing the checksum for the whole index. These become expensive when the repo is large. Requirement: Suppose n is the number objects in a repo -Keep the read/write time <= O(logn). -Keep the checksum computed against only necessary objects. -New format is easy to parse. -Backward compatible. -Potentially use faster hash method. Proposed solution: Store the repo structure in a canonical tree. Each directory is a tree block. A tree block contains blobs and offsets to sub directories. It has its own checksum. A read/write will be done on tree block base. A tree block also contains an offset points to its previous version (if there's one). The root of offset is stored in trailer, which stays at the end of file. Each update creates a new trailer which points to the new tree. (details below) To save the pain of modify a tree block and track the free spaces in the index, changed blocks are appended at the end. Based on the assumption that user won't change too many files each time, for an updated file, all its parent blocks to Root was copied and appended to index. In other words, all traversed blocks are copied. The offset of previous version of the updated block is stored in the new block. A new generation number is stored in copied and updated blocks. Other blocks are not copied. They are referenced by offsets. After update a new trailer is created at the end. In this way, there's no harm to read, and makes write fast. Trailer stores the offset of previous trailer. It makes tracking old versions easy. Each operation will load and rewrite the header and visited trailer . Checksum for non identifier purpose will use crc32. Otherwise it uses sha1. For compatibility, old format of index will be transformed to new format in the first operation. == Index format: - A 8-byte header consisting of 4-byte signature: The signature is { 'T', 'R', 'E', 'E' } 4-byte version number: The current supported versions are 4. - A number of blocks of variable size 1-byte block type 3-byte content size in byte block content 4-byte crc32 of all above - A 20-byte trailer consisting of 4-byte trailer signature: The signature is { 'R', 'O', 'O', 'T' } 4-byte root block offset 4-byte extension table offset: Zero means no extension 4-byte offset to previous trailer 4-byte checksum: CRC32 of the header and the trailer (excluding this field) == Tree block: Tree block content is basically the list of entries in a specified path, with all attributes we store in the index now. This entry list is sorted by pathname. For doing a bsearch in the list, pathnames are stored at the end of block, which makes the size of entry fixed. The pathname is pointed from each entry with 2 byte offset (relative to a block). This should not be problem as a block is never modified. It stores the generation number and the offset to old block. It also integrates the content of cache-tree. A tree block content consists of - 4-byte tree object size - 20-byte SHA-1 of the cached tree object - checkpoint : interleave with items in the block, 1 for every 100 items 4-byte offset to next checkpoint - a list attributes corresponding to tree object's item, in the same order. These attributes are the same as in DIRC entry format except that entry name is removed, and a tree block offset is added in case the item is a directory. 32-bit ctime seconds, the last time a file's metadata changed this is stat(2) data 32-bit ctime nanosecond fractions this is stat(2) data 32-bit mtime seconds, the last time a file's data changed this is stat(2) data 32-bit mtime nanosecond fractions this is stat(2) data 32-bit dev this is stat(2) data 32-bit ino this is stat(2) data 32-bit mode, split into (high to low bits) 4-bit object type valid values in binary are 1000 (regular file), 1010 (symbolic link) and 1110 (gitlink) 3-bit unused 9-bit unix permission. Only 0755 and 0644 are valid for regular files. Symbolic links and gitlinks have value 0 in this field. 32-bit uid this is stat(2) data 32-bit gid this is stat(2) data 32-bit file size This is the on-disk size from stat(2), truncated to 32-bit. 160-bit SHA-1 for the represented object if blobs or the offset to another tree block if trees A 32-bit 'flags' field split into (high to low bits) 1-bit assume-valid flag 1-bit extended flag (must be zero in version 2) 2-bit stage (during merge) 12-bit name length if the length is less than 0xFFF; otherwise 0xFFF is stored in this field. 1-bit skip-worktree flag (used by sparse checkout) 1-bit intent-to-add flag (used by "git add -N") 14-bit unused, must be zero A 16-bit offset, relative to the beginning of this block, to the pathname of this entry - 2-byte generation number, starts from 1 - 4-byte previous version offset - a list of NUL-terminated pathnames, pointed to from the 16-bit offset above == Extension table block Extension table has block type 'X'. It consists of a series of 4-byte extension block offset. == Extension block Extension block has block type 'E'. Extension content is the same as in the old format. Time line: 24/04 ~ 21/05: get familiar with code base and revise proposal benchmark with linux kernel on major operations write prototype to prove feasibility of proposed solution consult mailing list & irc 22/05 ~ 25/06 write code, test and benchmark modify tree lib modify index format operations modify git operations transform from old to new 26/06 ~ 30/07 revise things according to benchmark 31/07 ~ 13/08 update documentation About me My name is Elton Tian, I am from China. I have been living in Australia for quite a few years. I am currently a Master student from Australia National University. After graduate I worked on linux based web development (using tcl) for 2.5 years. I have been programming with c, c#, java, tcl, php, javascript and shell script. But I prefer c, which gives me the feeling of full control over the program. I am interested in data intensive computing. I played with hadoop and mapreduce since 2010. I maintained my 3 node cluster behind my desk. As linus hates cvs with a passion, I still want to mention I was using cvs in work, don't like it though. And I guess this is the good chance to get over it. On Thu, Apr 5, 2012 at 2:22 AM, elton sky <eltonsky9404@xxxxxxxxx> wrote: > Hello, > > Some updates for Nguyen's index: > > On Wed, Apr 4, 2012 at 10:20 PM, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote: >> On Wed, Apr 4, 2012 at 3:26 PM, elton sky <eltonsky9404@xxxxxxxxx> wrote: >>> I am not sure how the trailer works. >>> I assume there can be multiple trailers, each update will generate a >>> new one. Every trailer will point to the root tree (i.e. all trailers >>> point to the same block?). So if there are some changes to root, like >>> rename, trailers all point to the latest root block? >> >> Each trailer points to the whole new tree. Because trees are >> immutable, changing in a tree meangs creating a new one and will also >> make a new parent tree (to point to the updated tree because old >> parent will always point to old tree). This eventually leads to root >> tree change, recorded by the trailer. >> >>> Is the index looks like : >>> | HEADER | TREE BLOCKS | TRAILER | TREE BLOCKS | TRAILER | TREE >>> BLOCKS | TRAILER | ... >>> > > Once an update happened to a block, all parent blocks to root will be > copied to the end of the index. And the updated block will be at leaf > of the new tree. Finding this path costs logn time anyway. This is no > harm for read for the whole tree. But in order to find all previous > changes to a block, we have to go through all trailers and trees. > > Otherwise, just modify the original tree. Let the parent points to the > updated block and let the updated block points to the old block: > > parent > | > V > updated old old > block (v3) --> block(v2) --> block (v1) > | > V > child > blocks > > A version number in a tree block is used to track the changes. > In this way, there's still no harm to read, and it's more easy to > trace the change history of a block. Also, we don't need to create > interleaved blocks and trailers. There's only one trailer in the end > of file. We also need to add another offset points to previous > version. > > Trailer is kept at the end of index, as its size is variable. It > contains offset to root and list of free spaces. > > Changes to format: > >> = The git index file has the following format > > As is, except there's only one trailer now. And trailer contains list > of free spaces. > >>- A 18-byte trailer consisting of >> >> 4-byte trailer signature: >> The signature is { 'R', 'O', 'O', 'T' } >> >> 2-byte generation: >> The first trailer is 0, the second 1 and so on. >> >> 4-byte root block offset >> >> 4-byte extension table offset: >> Zero means no extension >> > list of free spaces > - 4 byte offset > - 2 byte length > >> 4-byte checksum: >> CRC32 of the header and the trailer (excluding this field) > > Free space list is read/written in whole for each operation, together > with trailer. > >> >> == Tree block >> ... > > Above as is. > > - 1 byte version num > > - 4 byte offset to previous version block > >> >> 160-bit SHA-1 for the represented object if blobs or the offset >> to another tree block if trees >> >> A 32-bit 'flags' field split into (high to low bits) >> >> 1-bit assume-valid flag >> >> 1-bit extended flag (must be zero in version 2) >> >> 2-bit stage (during merge) >> >> 12-bit name length if the length is less than 0xFFF; otherwise 0xFFF >> is stored in this field. >> >> 1-bit skip-worktree flag (used by sparse checkout) >> >> 1-bit intent-to-add flag (used by "git add -N") >> >> 14-bit unused, must be zero >> >> A 16-bit offset, relative to the beginning of this block, to the >> pathname of this entry. FIXME: make it 32-bit, relative to the >> beginning of the file, so that we can reuse pathnames from other >> (old) blocks? >> > > -Elton -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html