Re: Index format v5

Thomas Gummerer <t.gummerer@xxxxxxxxx> · Fri, 4 May 2012 17:44:24 +0200

On 05/04, Nguyen Thai Ngoc Duy wrote:
> On Fri, May 4, 2012 at 12:25 AM, Thomas Gummerer <t.gummerer@xxxxxxxxx> wrote:
> > GIT index format
> > ================
> >
> > = The git index file has the following format
> >
> >  All binary numbers are in network byte order. Version 5 is described
> >  here.
> >   ...
> >   - A number of directory offsets (see below). [1]
> >
> >   - A number of sorted directories (see below). [2]
> >
> >   - 32-bit crc32 checksum for the header, extension offsets and directories.
> 
> So we use one checksum for all dirs? I thought we could do checksum
> per dir, so if I'm interested in path/to/here only, I only need to
> verify data of three directories.

Good point. Not sure how they could exactly be implemented, but probably
one checksum for offset + directory data. I'll definitely think about this.

> > == Directory entry offsets
> >
> >  32-bit offset to the directory.
> >
> >  This part is needed for making the directory entries bisectable and
> >    thus allowing a binary search.
> 
> How is this (I assume) array ordered? The same top-down depth-first
> with "Directory entry" section below? I can see ordering as
> top-down/breadth-first help bsearch though.

True, the breadth-first approach might be better, since we are using
prefix compression for the pathname. It will need some more offsets
(or calculation, but should still be faster)

> > == Directory entry
> >
> >  Directory entries are sorted in lexicographic order by the name
> >  of their path starting with the root.
> >
> >  Path names (variable length) relative to top level directory (without the
> >    leading slash). '/' is used as path separator. '.' indicates the root
> >    directory. The special patch components ".." and ".git" (without quotes)
> >    are disallowed. Trailing slash is also disallowed.
> >
> >  1 nul byte to terminate the path.
> 
> I don't see it mention prefix compression here, nor in "file entry"
> section. Does it use it here? If so I don't think prefix compression
> plays well with bsearch (on path name). In the worst case you may have
> to process up to the first entry in order to get a path name (e.g. a
> directory with entries "a", "aa", "aaa", "aaaa"...)

I planned to use prefix compression here, which would benefit especially
the reader (we're reading more often then writing). By designing the
offsets carefully we should still be able to get log(n) (n = number of
directories in the index) search time for a directory.

> >  The entries are written out in the top-down, depth-first order. The
> >    first entry represents the root level of the repository, followed by
> >    the first subtree - let's call it A - of the root level, followed by
> >    the first subtree of A, ...
> 
> So depth-first traversal becomes natural even without the help of
> directory offset table above. Nice.
> 
> > == File entry
> >
> >  File entries are sorted in ascending order on the name field, after the
> >  respective offset given by the directory entries.
> 
> I wonder if we need to keep file entry table separate from directory
> entry. It feels more natural to put the sequence of file entries of a
> directory right after the directory entry, might help read-ahead too
> during traversal. You save 4 bytes (for file entry offset) in each
> directory entry. You still have file offset table for random access.

The reason for this design choice is the fast searching of a directory, 
(for partial reading or changing a single file in the index). Keeping
them separate also simplifies the reading of the cache-tree, which will
be included in the directory section. Instead of offsets to the first file
we'd need offsets to the next directory to enable fast reading of the
cache-tree.

> >  File name (variable length). Nul bytes are not allowed in file names and
> >    they have no leading slash. They are 7-bit ASCII encoded.
> 
> Why can't it be 8-bit? I suppose file name is also prefix compressed?

I changed that, the file name can have UTF8 or ASCII encoding, as it was
allowed in the old index.

--
Thomas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html