Re: GSoC - Designing a faster index format

elton sky <eltonsky9404@xxxxxxxxx> · Mon, 2 Apr 2012 21:50:53 +1000

Hi Nguyen,

Still have some questions on your idea:

On Tuesday, March 27, 2012, Nguyen Thai Ngoc Duy wrote:
>
> On Tue, Mar 27, 2012 at 10:20 AM, elton sky <eltonsky9404@xxxxxxxxx> wrote:
> > On Tue, Mar 27, 2012 at 3:19 AM, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote:
> >> The header consists of crc32 and three uint32_t, one points to the
> >> root tree, one the first extension block, the last one the free list
> >> at the end of the file. The rest of the file contains sizable blocks.
> >> There can be free space between them. Free spaces (offset and size)
> >> are recorded at the end of the file, pointed in header. The header's
> >> crc32 covers the header and free list.
> >
> > How do you record free spaces at the end of file? Are you gonna have a
> > fixed size for the index and reserve space for free spaces offsets.
>
> A list of (offset,size) with (0,0) to terminate. No index can shrink
> or expand at will. Free list is always at the end of the index.
>
> >> A block starts with a signature (a tree block, or an extension...). A
> >> tree block consists of:
> >>
> >>  - uint32_t tree object's size
> >>  - sha-1 of tree object
> >>  - crc32 of the rest of the block except tree object
> >>  - maybe reference counter of a block can be refered by many blocks??
> >>  - tree object (i.e. something that tree-walk.c can parse)
> >
> > Do you mean each block contains a tree and all its blobs? So the tree
> > object here, effectively a dir, also contains files in the dir ? In
> > this way, some blocks can be very big.
>
> No, the tree object contains pathname, mode and SHA-1 of its entries,
> one level only (try "git ls-tree HEAD"). If an entry is a directory
> and we have not built it yet, we won't have its sha-1, so it will be
> zero (similar to invalid cache-tree).
>

Correct me if I am wrong, I assume:
* Although you only listed attributes in tree block, in index, we have
both tree and blob block.
* Sha1 of a tree block is computed by hashing all the Sha1s of blobs
and sub trees in the directory.
* Like cache tree struct, each tree object contains list of offsets to
children blocks.

How do you store blob blocks ? Is each blob object a block? just like
a tree object is a block? If so, each blob contains a sha1, the
overhead is high?

If we compute hash for a tree, and this tree happen to have 500
children do I have to access all 500 blocks to get their Sha1s?

>
> >>  - other index attributes, stored separately in the same order as in
> >> tree object above, uint32_t block offset of subdirectories.
> >
> > There can be many sub dirs some times. But maybe not a prob.
> >
> > As tree object and  offset of subdirectories are variables, how do you
> > make a block resizable?
>
> If there are free space right after it, it can be expanded. Otherwise
> we need to move the block elsewhere and update its parent about its
> new offset, then mark where the block was as free space.
> --
> Duy

Also I ran some quick test on git-add over kernel 2.6. When I do "time
git add .":time git add .

cmd_add: validate_pathspec takes : 0 ms
read_index_from: xmmap&close takes : 0 ms
read_index_from: verify_hdr takes : 26 ms
read_index_from: create inmem struct takes : 4 ms
read_index: read_index_from takes : 31 ms
read_directory: qsort takes : 0 ms
fill_directory: read_directory takes : 97 ms
cmd_add: prune dir takes : 0 ms
cmd_add: add_files_to_cache takes : 37 ms
cmd_add: add_files takes : 0 ms

real 0m0.172s
user 0m0.120s
sys 0m0.050s

And when I ran "time git add arch/ia64" :

cmd_add: validate_pathspec takes : 0 ms
read_index_from: xmmap&close takes : 0 ms
read_index_from: verify_hdr takes : 20 ms
read_index_from: create inmem struct takes : 4 ms
read_index: read_index_from takes : 25 ms
read_directory: read_directory_recursive takes : 10 ms
read_directory: qsort takes : 0 ms
fill_directory: read_directory takes : 10 ms
cmd_add: fill_directory takes : 10 ms
cmd_add: prune dir takes : 0 ms
cmd_add: add_files_to_cache takes : 1 ms

real 0m0.043s
user 0m0.040s
sys 0m0.000s

In both cases, the time for sha1 is quite stable (~20ms).
fill_directory drops as I specify a sub directory, which make sense.
As Junio suggested, the sha1 time (verify_hdr) is a mix a read and
sha1. And this part is our focus to optimize, isn't it? But, as growth
of the whole repo, the processing time is getting dominated by
fill_directory (if we use '.', it takes 97ms) rather than verify_hdr.
In current system, The time complexity for fill_directory is nlogn (n
is number of objects). git recursively go thru sub directories and
files in it and check against current index. When searching index, it
uses binary search which makes it lg(n). If this is the case, will use
a producer/consumer model help?

Cheers,
Elton
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html