Re: GSoC - Designing a faster index format

Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> · Mon, 2 Apr 2012 19:31:46 +0700

On Mon, Apr 02, 2012 at 09:50:53PM +1000, elton sky wrote:
> Hi Nguyen,
> 
> Still have some questions on your idea:
> 
> On Tuesday, March 27, 2012, Nguyen Thai Ngoc Duy wrote:
> > >> A block starts with a signature (a tree block, or an extension...). A
> > >> tree block consists of:
> > >>
> > >>  - uint32_t tree object's size
> > >>  - sha-1 of tree object
> > >>  - crc32 of the rest of the block except tree object
> > >>  - maybe reference counter of a block can be refered by many blocks??
> > >>  - tree object (i.e. something that tree-walk.c can parse)
> > >
> > > Do you mean each block contains a tree and all its blobs? So the tree
> > > object here, effectively a dir, also contains files in the dir ? In
> > > this way, some blocks can be very big.
> >
> > No, the tree object contains pathname, mode and SHA-1 of its entries,
> > one level only (try "git ls-tree HEAD"). If an entry is a directory
> > and we have not built it yet, we won't have its sha-1, so it will be
> > zero (similar to invalid cache-tree).
> >
> 
> Correct me if I am wrong, I assume:
> * Although you only listed attributes in tree block, in index, we have
> both tree and blob block.

In current index, tree is implied in path names. We only store blob sha-1.

> * Sha1 of a tree block is computed by hashing all the Sha1s of blobs
> and sub trees in the directory.
> * Like cache tree struct, each tree object contains list of offsets to
> children blocks.
> 
> How do you store blob blocks ? Is each blob object a block? just like
> a tree object is a block? If so, each blob contains a sha1, the
> overhead is high?
> 
> If we compute hash for a tree, and this tree happen to have 500
> children do I have to access all 500 blocks to get their Sha1s?

We don't store blobs in index. We only need their sha-1, which is
computed and content stored in object database at "git add".

By the way, I revised my new index format a little bit, see the end of
this email. It may work, or may not. Food for thoughts.

> Also I ran some quick test on git-add over kernel 2.6. When I do "time
> git add .":time git add .
> 
> cmd_add: validate_pathspec takes : 0 ms
> read_index_from: xmmap&close takes : 0 ms
> read_index_from: verify_hdr takes : 26 ms
> read_index_from: create inmem struct takes : 4 ms
> read_index: read_index_from takes : 31 ms
> read_directory: qsort takes : 0 ms
> fill_directory: read_directory takes : 97 ms
> cmd_add: prune dir takes : 0 ms
> cmd_add: add_files_to_cache takes : 37 ms
> cmd_add: add_files takes : 0 ms
> 
> real 0m0.172s
> user 0m0.120s
> sys 0m0.050s
> 
> And when I ran "time git add arch/ia64" :
> 
> cmd_add: validate_pathspec takes : 0 ms
> read_index_from: xmmap&close takes : 0 ms
> read_index_from: verify_hdr takes : 20 ms
> read_index_from: create inmem struct takes : 4 ms
> read_index: read_index_from takes : 25 ms
> read_directory: read_directory_recursive takes : 10 ms
> read_directory: qsort takes : 0 ms
> fill_directory: read_directory takes : 10 ms
> cmd_add: fill_directory takes : 10 ms
> cmd_add: prune dir takes : 0 ms
> cmd_add: add_files_to_cache takes : 1 ms
> 
> real 0m0.043s
> user 0m0.040s
> sys 0m0.000s
> 
> In both cases, the time for sha1 is quite stable (~20ms).
> fill_directory drops as I specify a sub directory, which make sense.
> As Junio suggested, the sha1 time (verify_hdr) is a mix a read and
> sha1. And this part is our focus to optimize, isn't it?

I think so. But until we can read just parts of index, we still have
to verify integrity for the whole index, which takes more or less the
same amount of time you see and should be proportional to index size
(or the number of entries in index). Either we shrink the index, or go
with cheaper checksum, or both.

> But, as growth
> of the whole repo, the processing time is getting dominated by
> fill_directory (if we use '.', it takes 97ms) rather than verify_hdr.

{read,fill}_directory is not always used (for example, "git diff" does
not need it). Meanwhile, as working directory grows, index size grows,
verify_hdr() will take longer.

> In current system, The time complexity for fill_directory is nlogn (n
> is number of objects). git recursively go thru sub directories and
> files in it and check against current index. When searching index, it
> uses binary search which makes it lg(n). If this is the case, will use
> a producer/consumer model help?

I think fill_directory is dominated by kernel time (read_dir,
stat...), there's little thing we can do there. Anyway I'm pretty sure
fill_directory is out of scope. It's just one of the code that uses
index.

-- 8< --
GIT index format
================

This format replaces the old "DIRC" format. Compared to the old
format, which is essentially a sorted list of pathnames, this one:

 - is tree-based
 - use crc32 as checksum
 - only verify integrity on parts that git accesses, instead of whole
   file
 - append changes to the end
 - allow index versioning

Updates can be made directly to the index by appending to the end. The
index traversed by locating the root tree block from the trailer. When
a path is updated, all related tree blocks are updated and appended to
the end, then a new trailer (with generation increased by one) is
written to conclude the index.

The index size will increase continuously. At some point, we will need
to repack it. Let assume a tree block is 64k on average and a path
generally consists of 3 path components.  That means an entry update
adds 192k and we can do about 80 updates before index reaches 16M (in
addition to initial index size).

At 16M or when trailer generation hits a limit (the limit can be
configurable), we rewrite the index to reduce its size. Some heavy
operations can also be used to rewrite index, such as checkout or
reset.

The index integrity is verified by crc32. One crc32 covers header and
trailer. Each block has its own crc32. When the index is found
corrupt, we could try to roll back to latest good version by looking
for trailers from bottom up. Even when the index is not corrupt, users
can still look back this way for older index versions.

= The git index file has the following format

   - A 8-byte header consisting of

     4-byte signature:
       The signature is { 'T', 'R', 'E', 'E' }

     4-byte version number:
       The current supported versions are 1.

   - A number of blocks of variable size

      1-byte block type

      3-byte content size in byte

      block content

      4-byte crc32 of all above

   - A 18-byte trailer consisting of

      4-byte trailer signature:
        The signature is { 'R', 'O', 'O', 'T' }

      2-byte generation:
         The first trailer is 0, the second 1 and so on.

      4-byte root block offset

      4-byte extension table offset:
        Zero means no extension

      4-byte checksum:
        CRC32 of the header and the trailer (excluding this field)

== Tree block

  A tree block contains a (maybe invalid) tree object and extra
  information of its companion in working directory. Tree block has
  block type 'T'.

  Tree block content is basically the list of non-recursive entries in
  specified path, with all attributes we store in the index now. There
  are a few changes though to intergrate cache-tree and allow
  bsearch() on mmap'd block.

  A tree block content consists of

  - 4-byte tree object size

  - 20-byte SHA-1 of the cached tree object

  - a list attributes corresponding to tree object's item, in the same
    order.  These attributes are the same as in DIRC entry format
    except that entry name is removed, and a tree block offset is
    added in case the item is a directory.

    32-bit ctime seconds, the last time a file's metadata changed
      this is stat(2) data

    32-bit ctime nanosecond fractions
      this is stat(2) data

    32-bit mtime seconds, the last time a file's data changed
      this is stat(2) data

    32-bit mtime nanosecond fractions
      this is stat(2) data

    32-bit dev
      this is stat(2) data

    32-bit ino
      this is stat(2) data

    32-bit mode, split into (high to low bits)

      4-bit object type
        valid values in binary are 1000 (regular file), 1010 (symbolic link)
        and 1110 (gitlink)

      3-bit unused

      9-bit unix permission. Only 0755 and 0644 are valid for regular files.
      Symbolic links and gitlinks have value 0 in this field.

    32-bit uid
      this is stat(2) data

    32-bit gid
      this is stat(2) data

    32-bit file size
      This is the on-disk size from stat(2), truncated to 32-bit.

    160-bit SHA-1 for the represented object if blobs or the offset
      to another tree block if trees

    A 32-bit 'flags' field split into (high to low bits)

      1-bit assume-valid flag

      1-bit extended flag (must be zero in version 2)

      2-bit stage (during merge)

      12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
      is stored in this field.

      1-bit skip-worktree flag (used by sparse checkout)

      1-bit intent-to-add flag (used by "git add -N")

      14-bit unused, must be zero

    A 16-bit offset, relative to the beginning of this block, to the
      pathname of this entry. FIXME: make it 32-bit, relative to the
      beginning of the file, so that we can reuse pathnames from other
      (old) blocks?

  - a list of NUL-terminated pathnames, pointed to from the 16-bit offset
    above. This list does not have to be of the same order as the attribute
    list. The reason this is separated from the attribute list is to make
    attribute list fixed size, searchable using bsearch().

== Extension table block

 Extension table has block type 'X'. It consists of a series of 4-byte
 extension block offset.

== Extension block

 Extension block has block type 'E'. Extension content is the same as
 in the old format.
-- 8< --
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html