Re: [PATCH/RFC] Document format of basic Git objects

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 15 Feb 2012 11:48:09 -0800

Nguyễn Thái Ngọc Duy  <pclouds@xxxxxxxxx> writes:

> This is just a draft text with a bunch of fixmes. But I'd like to hear
> from the community if this is a worthy effort. If so, then whether
> git-cat-file is a proper place for it. Or maybe we put relevant text
> in commit-tree, write-tree and mktag, then refer to them in cat-file
> because cat-file can show raw objects.
>
> So comments?

This _only_ describes the payload (i.e. without the 'blob <size>\n' header
used in loose object, in other words, what read_object() may return).
There should be a sentence to stress this.  As many Git intros (including
my book) begin with the "a short header 'blob <size>\n' concatenated with
the contents is hashed to compute the object name" picture, it would be
confusing unless you explicitly say that you are only describing the
"contents" part.

It makes sense to mention that the cat-file subcommand is used to obtain
this raw data somewhere in the documentation, but I would say the content
of this patch belongs to Documentation/technical/ somewhere.

>  PS. This also makes me wonder if tag object supports "encoding".

I do not think so.

> +OBJECT FORMAT
> +-------------
> +
> +Tree object consists of a series of tree entries sorted in memcmp()
> +order by entry name. Each entry consists of:
> +
> +- POSIX file mode encoded in octal ascii

Add ", no 0 padding to the right" at the end, as I heard that every
imitation of Git gets this wrong in its first version.

> +- One space character
> +- Entry name terminated by one character NUL
> +- 20 byte SHA-1 of the entry

> +Tag object is ascii plain text in a format similar to email format
> +(RFC 822). ...

Do not mention "email format (RFC 822)" at all.  The differences are
significant enough that it only confuses the readers.

We do not have colon at the end of the header, we do not promise to parse
field names case insensitively, and the way continuation lines are parsed
is totally different (a "similar" construct in RFC 2822 is "folded header
lines", but it is signalled by "folding white space", it discards the
end-of-line from the previous line and makes the result a logical single
line. Our continuation lines are introduced by a single SP and the result
of concatenation keeps the end-of-line from the previous lines, making the
result multiple lines).

Also we do not promise that the lines in the header part are always
<field,value> pairs.  So rephrase this while carefully distinguishing
between "a line in header" and "field".

    A commit or a tag object begins with the "header" that consists of one
    or more lines delimited by LF. The end of the header is signalled by
    an empty line.

    A "continuation line" in the header begins with a SP.  The remainder
    of the line, after removing that SP, is concatenated to the previous
    line, while retaining the LF at the end of the previous line.

    When a line in the header begins with a letter other than SP, and has
    at least one SP in it, it is called a "field".  A field consists of
    the "field name", which is the string before the first SP on the line,
    and its "value", which is everything after that SP.  When the value
    consists of multiple lines, continuation lines are used.

    More than one field with the same name can appear in the header of an
    object, and the order in which they appear is significant.

    In a commit object, the header begins with the following fields that
    have such and such meaning.

    In a tag object, the header begins with the following fields...

    After these defined fields, newer versions of git may add more lines
    in the header. Some of them may be fields, others might not be. The
    implementations to parse commit and tag objects must ignore lines in
    the header that it does not understand without triggering an error.

>  ... It consists of a header and a body, separated by a blank
> +line. The header includes exactly four fields in the following order:
> +

If you hand-craft a tag-like object that has unknown field after these
four, how badly the current implementations behave?

> +1. "object" field, followed by SHA-1 in ascii of the tagged object
> +2. "type" field, followed by the type in ascii of the tagged object
> +   (either "commit", "tag", "blob" or "tree" without quotes,
> +   case-sensitive)
> +3. "tag" field, followed by the tag name
> +4. "tagger" field, followed by the <XXX, to be named>

> +The tag body contains the tag's message and possibly GPG signature.
> +
> +Commit object is in similar format to tag object. The commit body is

It is strange that you introduce tag and then commit.  I would think that
readers expect to see them presented in the usual blob/tree/commit/tag
order.

> +in plain text of the chosen encoding (by default UTF-8). The commit
> +header has the following fields in listed order
> +
> +1. One "tree" field, followed by the commit's tree's SHA-1 in ascii
> +2. Zero, one or more "parent" field
> +3. One "author" field, in <XXX to be named> format
> +3. One "committer" field, in <XXX to be named> format
> +4. Optionally one "encoding" field, followed by the encoding used for
> +   commit body
> +5. GPG signature (fixme)
> +
> +More headers after these fields are allowed. Unrecognized header
> +fields must be kept untouched if the commit is rewritten.

Replace the first sentence with "New kinds of fields may be added in later
versions of git." and drop the second one entirely.  Depending on the
reason and nature of the "rewrite", we may or may not want to keep these
unknown header lines, so it is best to leave the behaviour unspecified.
For example, it makes sense to retain "mergetag" because it is about the
parent, not the resulting commit.  It does not make sense to keep "gpgsig"
because it is about the commit you are rewriting to invalidate that old
signature.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html