Re: [PATCH] Document pack v4 format

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 27 Aug 2013 11:25:02 -0700

Nguyễn Thái Ngọc Duy  <pclouds@xxxxxxxxx> writes:

> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx>
> ---
>  For my education but may help people who are interested in the
>  format. Most is gathered from commit messages, except the delta tree
>  entries.

Thanks.

> diff --git a/Documentation/technical/pack-format-v4.txt b/Documentation/technical/pack-format-v4.txt

In the final version it may be a good idea to either have this
together with the documentation for the existing pack-formats, or
add a reference from the documentation for the existing formats to
point at this new file saying "for v4 see ...".

> new file mode 100644
> index 0000000..9123a53
> --- /dev/null
> +++ b/Documentation/technical/pack-format-v4.txt
> @@ -0,0 +1,110 @@
> +Git pack v4 format
> +==================
> +
> +== pack-*.pack files have the following format:
> +
> +   - A header appears at the beginning and consists of the following:
> +
> +     4-byte signature:
> +	  The signature is: {'P', 'A', 'C', 'K'}
> +
> +     4-byte version number (network byte order): must be version
> +     number 4
> +
> +     4-byte number of objects contained in the pack (network byte
> +     order)
> +
> +   - (20 * nr_objects)-byte SHA-1 table: sorted in memcmp() order.
> +
> +   - Commit name dictionary: the uncompressed length in variable
> +     encoding, followed by zlib-compressed dictionary. Each entry
> +     consists of two prefix bytes storing timezone followed by a
> +     NUL-terminated string.

The log and code use different names to call this thing.  "commit
name" is misleading (e.g. it is not "commit object name", but "names
recorded in commit objects"; it is not only for "committer" names,
but also applies to authors; it is not just names but also emails
and TZ used).  Perhaps a better name would be "ident" table, as we
use the word "ident" only to refer to data to refer to people who
are recorded on either author/committer/tagger lines of the objects?

> +     (undeltified representation)
> +     n-byte type and length (4-bit type, (n-1)*7+4-bit length)
> +     [uncompressed data]
> +     [compressed data]

These two lines are not useful; it is better spelled as [data
specific to object type] as you have to enumerate what are stored
and how for each type separately anyway.

> +=== Tree representation
> +
> +  - n-byte type and length (4-bit type, (n-1)*7+4-bit length)
> +
> +  - Number of trees in variable length encoding
> +
> +  - A number of trees, each consists of

The above "number of trees" sounds both wrong; aren't they the
number of "tree entries" (that can be blobs or subtrees) this tree
object records?

> +    Path component reference: an index, in variable length encoding,
> +    into tree path dictionary, which also covers entry mode.
> +
> +    SHA-1 in SHA-1 reference encoding.
> +
> +Path component reference zero is an indicator of deltified portion and
> +has the following format:
> +
> +  - path component reference: zero
> +
> +  - index of the entry to copy from, in variable length encoding
> +
> +  - number of entries in variable length encoding
> +
> +  - base tree in SHA-1 reference encoding
> +
> +=== SHA-1 reference encoding
> +
> +This encoding is used to encode SHA-1 efficiently if it's already in
> +the SHA-1 table. It starts with an index number in variable length
> +encoding. If it's not zero, its value minus one is the index in the
> +SHA-1 table. If it's zero, 20 bytes of SHA-1 is followed.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html