Re: [PATCH/RFC] Document format of basic Git objects

Jonathan Nieder <jrnieder@xxxxxxxxx> · Wed, 15 Feb 2012 11:31:27 -0600

Hi,

Nguyễn Thái Ngọc Duy wrote:

> Basic objects' format is pretty simple and (I think) well-known.
> However it's good that we document them. At least we can keep track of
> the evolution of an object format. The commit object, for example,
> over the years has learned "encoding" and recently GPG signing.

Yes, I agree.

> This is just a draft text with a bunch of fixmes. But I'd like to hear
> from the community if this is a worthy effort. If so, then whether
> git-cat-file is a proper place for it. Or maybe we put relevant text
> in commit-tree, write-tree and mktag, then refer to them in cat-file
> because cat-file can show raw objects.

About where to place this text, I am of two minds.

1. On one hand, from the user's perspective it would be most intuitive
to place it in a separate git-object(5) manual page.  That way,
gitrepository-layout(5), git-fsck(1), git-hash-object(1), the user
manual, and so on would all have one document to link to.

2. On the other hand, from a development perspective I suspect it
would be valuable to put it in the git-fsck(1) page, since that would
have two consequences:

 - when changing the documentation, this would provide a reminder to
   update fsck.c at the same time

 - when changing fsck.c, this would provide a reminder to update the
   documentation at the same time

Ok, (2) was tongue in cheek. :)  I believe this information belongs in
a dedicated page with a name like gitobject(5), and that you are right
to put it in user-visible documentation instead of hiding it in
Documentation/technical, since it is information needed if one is to
use "git hash-object -w" correctly.

Ok, on to the text itself.

[...]
> --- a/Documentation/git-cat-file.txt
> +++ b/Documentation/git-cat-file.txt
> @@ -100,6 +100,46 @@ for each object specified on stdin that does not exist in the repository:
>  <object> SP missing LF
>  ------------
>  
> +OBJECT FORMAT
> +-------------
> +
> +Tree object consists of a series of tree entries sorted in memcmp()
> +order by entry name.

Missing article ("A tree object", "The tree object", or "Each tree
object").  More importantly, the curious reader might want to know
whether a tree object is supposed to contain entries pointing to other
tree objects for subdirectories or whether the subdirectory's
information is included inline like in the index.

I guess I would expect something like (stealing from the user manual):

	TREE OBJECTS
	------------
	A tree object contains a list of entries, each with a mode,
	object type, object name, and filename, sorted by filename.  It
	represents the contents of a single directory tree.

	The object type may be a blob, representing the contents of a
	file, another tree, representing the contents of a subdirectory,
	or a commit (representing a subproject).  Since trees and blobs,
	like all other objects, are named by a hash of their contents,
	two trees have the same object name if and only if their
	contents (including, recursively, the contents of all
	subdirectories) are identical.  This allows git to quickly
	determine the differences between two related tree objects,
	since it can ignore any entries with identical object names.

	Note that the files all have mode 644 or 755: git actually only
	pays attention to the executable bit.

	Encoding
	~~~~~~~~
	Entries are of variable length and self-delimiting.  Each entry
	consists of

	- a POSIX file mode in octal representation
	- exactly one space (ASCII SP)
	- filename for the entry, as a NUL-terminated string
	- 20-byte binary object name

	The mode should be 100755 (executable file), 100644 (regular
	file), 120000 (symlink), 40000 (subdirectory), or 160000
	(subproject), with no leading zeroes.  Modes with one leading
	zero and the synonym 100664 for 100644 are also accepted for
	historical reasons.

	The filename may be an arbitrary nonempty string of bytes, as
	long as it contains no '/' or NUL character.

	The associated object must be a valid blob if the mode indicates
	a file or symlink, tree if it indicates a subdirectory, or
	commit if it indicates a subproject.  The blob associated to a
	symlink entry indicates the link target and its content not
	have any embedded NULs.

By the way, git fsck seems to tolerate the old "flat tree" format
(i.e., that condition is FSCK_WARN and not FSCK_ERROR), but I don't
see any code supporting it elsewhere in git.  Bug?

	Sorting
	~~~~~~~
	... no duplicates, sort order, etc ...

[...]
> +Tag object is ascii plain text in a format similar to email format
> +(RFC 822). It consists of a header and a body, separated by a blank
> +line.

The above description makes me worry that the reader might try some
things that are allowed by RFC 822: rearranging header fields,
continuation lines, and so on.

>        The header includes exactly four fields in the following order:
> +
> +1. "object" field, followed by SHA-1 in ascii of the tagged object
> +2. "type" field, followed by the type in ascii of the tagged object
> +   (either "commit", "tag", "blob" or "tree" without quotes,
> +   case-sensitive)
> +3. "tag" field, followed by the tag name
> +4. "tagger" field, followed by the <XXX, to be named>
> +
> +The tag body contains the tag's message and possibly GPG signature.

This part looks good.  Stealing from the user manual again, maybe:

	TAG OBJECTS
	-----------
	A tag object contains an object, object type, tag name, the name
	of the person ("tagger") who created the tag, and a message,
	which may contain a signature.

	------------------------------------------------
	$ git cat-file tag v1.5.0
	object 437b1b20df4b356c9342dac8d38849f24ef44f27
	type commit
	tag v1.5.0
	tagger Junio C Hamano <junkio@xxxxxxx> 1171411200 +0000

	GIT 1.5.0
	-----BEGIN PGP SIGNATURE-----
	Version: GnuPG v1.4.6 (GNU/Linux)

	iD8DBQBF0lGqwMbZpPMRm5oRAuRiAJ9ohBLd7s2kqjkKlq1qqC57SbnmzQCdG4ui
	nLE/L9aUXdWeTFPron96DLA=
	=2E+0
	-----END PGP SIGNATURE-----
	------------------------------------------------

	More precisely, a tag contains at least five lines:

	1. "object", followed by a space, followed by the 40-character
	   textual object name of the tagged object
	2. "type" + SP + the type of the tagged object ("commit", "tag",
	   "blob", or "tree")
	3. "tag" + SP + the name of the tag
	4. "tagger" + SP + an ident string
	5. a blank line

	Any remaining text after these lines forms the tag message.

	The object field must point to a valid object of type indicated
	by the type field.  The tag name can be an arbitrary string
	without NUL bytes or embedded newlines; in practice it usually
	follows the restrictions described in git-check-ref-format(1).

[...]
> +
> +Commit object is in similar format to tag object. The commit body is
> +in plain text of the chosen encoding (by default UTF-8). The commit
> +header has the following fields in listed order

Same considerations apply here --- I'd suggest stealing text from the
commit-object section of the user manual and from commit logs.

Hope that helps,
Jonathan

> +
> +1. One "tree" field, followed by the commit's tree's SHA-1 in ascii
> +2. Zero, one or more "parent" field
> +3. One "author" field, in <XXX to be named> format
> +3. One "committer" field, in <XXX to be named> format
> +4. Optionally one "encoding" field, followed by the encoding used for
> +   commit body
> +5. GPG signature (fixme)
> +
> +More headers after these fields are allowed. Unrecognized header
> +fields must be kept untouched if the commit is rewritten. However, a
> +compliant Git implementation produces the above header fields only.
> +
>  GIT
>  ---
>  Part of the linkgit:git[1] suite
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html