Re: Index format v5

Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> · Fri, 4 May 2012 20:20:38 +0700

On Fri, May 4, 2012 at 12:25 AM, Thomas Gummerer <t.gummerer@xxxxxxxxx> wrote:
> GIT index format
> ================
>
> = The git index file has the following format
>
>  All binary numbers are in network byte order. Version 5 is described
>  here.
>   ...
>   - A number of directory offsets (see below). [1]
>
>   - A number of sorted directories (see below). [2]
>
>   - 32-bit crc32 checksum for the header, extension offsets and directories.

So we use one checksum for all dirs? I thought we could do checksum
per dir, so if I'm interested in path/to/here only, I only need to
verify data of three directories.

> == Directory entry offsets
>
>  32-bit offset to the directory.
>
>  This part is needed for making the directory entries bisectable and
>    thus allowing a binary search.

How is this (I assume) array ordered? The same top-down depth-first
with "Directory entry" section below? I can see ordering as
top-down/breadth-first help bsearch though.

> == Directory entry
>
>  Directory entries are sorted in lexicographic order by the name
>  of their path starting with the root.
>
>  Path names (variable length) relative to top level directory (without the
>    leading slash). '/' is used as path separator. '.' indicates the root
>    directory. The special patch components ".." and ".git" (without quotes)
>    are disallowed. Trailing slash is also disallowed.
>
>  1 nul byte to terminate the path.

I don't see it mention prefix compression here, nor in "file entry"
section. Does it use it here? If so I don't think prefix compression
plays well with bsearch (on path name). In the worst case you may have
to process up to the first entry in order to get a path name (e.g. a
directory with entries "a", "aa", "aaa", "aaaa"...)

>  The entries are written out in the top-down, depth-first order. The
>    first entry represents the root level of the repository, followed by
>    the first subtree - let's call it A - of the root level, followed by
>    the first subtree of A, ...

So depth-first traversal becomes natural even without the help of
directory offset table above. Nice.

> == File entry
>
>  File entries are sorted in ascending order on the name field, after the
>  respective offset given by the directory entries.

I wonder if we need to keep file entry table separate from directory
entry. It feels more natural to put the sequence of file entries of a
directory right after the directory entry, might help read-ahead too
during traversal. You save 4 bytes (for file entry offset) in each
directory entry. You still have file offset table for random access.

>  File name (variable length). Nul bytes are not allowed in file names and
>    they have no leading slash. They are 7-bit ASCII encoded.

Why can't it be 8-bit? I suppose file name is also prefix compressed?
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html