Re: [PATCH/RFC v2] Document format of basic Git objects

Junio C Hamano <gitster@xxxxxxxxx> · Sun, 19 Feb 2012 00:39:41 -0800

Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> writes:

> Still draft for discussion. Of three people who participated on this
> thread, two favor a man page (me and Jonathan), one technical/
> (Junio), so let's put it as a man page for now.

Personally I do not have strong preference either way.

The original motivation of technical/ was that we wanted to have a place
to keep documentation that would help ourselves, the people who write the
internals of git, even though we did not yet know and did not want to have
to decide if it is a good idea to expose the end users, who may not care
about the gory details of the internal, with reams of such documents.

>  - Not sure if we fix the order of gpgsig and mergetag, or they can be
>    mixed together. Also not sure if we can have multiple gpgsig.

You can merge a signed tag and then sign the resulting commit yourself,
and the order of the mixing would not matter. Technically a gpgsig is a
signature over the commit object payload without gpgsig lines, so you
could have two or more gpgsigs on the same commit object but from a larger
workflow point of view it would not be so useful, as it would involve
steps like this:

 * You prepare a commit object, you may perhaps sign it yourself;

 * You expose this commit object to chosen others from whom you want their
   signature on it;

 * They sign it with "commit -S --amend", but when they do so they make
   sure the resulting commit has the same committer/author header as the
   original.  Note that the resulting commits will all have different
   object name, as the object name is over all payload including their
   gpgsigs.

 * You grab the gpgsig lines from these commits, paste them into the
   header part of the original, and then re-hash the result with
   "hash-object -w -t commit".  The result will have all valid gpgsigs
   over the payload in the commit without its gpgsig lines, because the
   gpgsig lines from all the signers were generated that way.

 * Then you give the general public the resulting commit.

>  - I skipped the experimental loose object format (it's what it's
>    called in sha1_file.c). I think we can call it deprecated and move
>    on.

Good.

>  - Do we assume tag/commit header in utf-8 or ascii?

Author-ident is typically utf-8 already, so you cannot assume "ASCII".

> +Object SHA-1
> +~~~~~~~~~~~~
> +An object SHA-1 is calculated on its header and payload. The content
> +to be consumed by SHA-1 calculation is:
> +
> +- Object type in ascii, either "commit", "tree", "tag" or "blob"
> +  (without quotes)
> +- One space (ASCII SP)
> +- The payload length in ascii canonical decimal format

"canonical" may make it sound as if the document is more formal, but then
you would have to define what is canonical and what is not somewhere else,
so I would suggest dropping it.

	The length of the payload in bytes, represented as a decimal integer.

Also if you spell ASCII, consistently spell it in all-caps.        

> +- ASCII NUL
> +- Object payload

----------------------------------------------------------------

> +BLOB OBJECTS
> +------------
> +Blob object payload is file data.

What's the significance of saying "file data" here?  In a document that
describes the structure, saying "is uninterpreted sequence of bytes" is
more accurate (the important point is that git does not care what it is)
and covers cases where blob was recorded with "hash-object -w --stdin"
where no such "file data" has ever existed in a 'file". Also a blob may
record contents of a symbolic link ;-).

> +TREE OBJECTS
> +------------
> +Tree object payload contains a list of entries, each with a mode,
> +object type, object name, and filename, sorted by filename. It
> +represents the contents of a single directory tree.

Drop "object type," from this list.  It is inferred from the mode.  I
personally would prefer to say "path" or "pathname" when the entity
referred to may not be a regular file.  I am not sure the last sentence is
necessary, but if you must say something, say "It represents a
directory". It is by definition redundant to say that a tree represents a
"tree".  Replace the above with something line this:

	... entries, each with a mode, object name and path. The type of
	the object is encoded in the "mode":

        - 100644 or 100755: the object is a "blob" that records the
          contents of a regular non-executable or executable file,
          respectively, that exists at the path.

	- 120000: the object is a "blob" that records the contents of a
          symbolic link that exists at the path.

	- 40000: the object is a "tree" that represents a subdirectory
          that exists at the path.

	- 160000: the object is a "commit" that records the state of a
          submodule that exists at the path.

> +The object type may be a blob, representing the contents of a file,
> +another tree, representing the contents of a subdirectory, or a commit
> +(representing a subproject).

and drop the above line.

> +Since trees and blobs, like all other
> +objects, are named by a hash of their contents, two trees have the
> +same object name if and only if their contents (including,
> +recursively, the contents of all subdirectories) are identical. This
> +allows git to quickly determine the differences between two related
> +tree objects, since it can ignore any entries with identical object
> +names.

It does not make sense to say 'trees and blobs' when you explain that a
single top-level tree object defines the entire tree's state.  Just say
'trees'.  I know you would say "I wanted to say if tree A and tree B are
the same except for the content of a single blob recorded at path P, the
result of hash for A and B would be different", but the same can be said
for a submodule, so singling out 'blob' is incomplete.  Also these trees
may record the same set of blobs but tree B may record what tree A had at
path P at path Q, so it is not like the only thing that matter in the tree
is the object names.

I personally do not think it is necessary to have the above paragraph at
all in this object.

> +Note that the files all have mode 644 or 755: git actually only pays
> +attention to the executable bit.

Saying 644 or 755 here is misleading as it does not match any reality
(except for very early incarnation of git).  By rewriting the first
paragraph, these two lines can be safely eliminated.

> +Encoding
> +~~~~~~~~

"Encoding" is such a loaded word and does not help clarify what this
section is really about, which is "format of a tree entry", or simply
"Entries".

> +Entries are of variable length and self-delimiting. Each entry
> +consists of
> +
> +- a POSIX file mode in octal ascii representation, no 0 padding to the
> +  left

This is not "a POSIX file mode" at all.  The mode in a tree entry was
modelled after that, but there is no need to mention it, especially
because POSIX does not define the exact bit assignment for types (the
permission are defined from S_IXOTH to S_IRWXU and S_ISUID/S_ISGID with
exact bit locations) and because of S_IFGITLINK which is clearly not
POSIX.  As we have enumerated them in the first paragraph,

	The "mode" (see above).

is sufficient here.

> +- exactly one space (ASCII SP)
> +- filename for the entry, as a NUL-terminated string

Again, "pathname" or just "path" for this entire document.

> +- 20-byte binary object name
> +
> +The mode should be 100755 (executable file), 100644 (regular file),
> +120000 (symlink), 40000 (subdirectory), or 160000 (subproject), with
> +no leading zeroes. Modes with one leading zero and the synonym 100664
> +for 100644 are also accepted for historical reasons. Other modes are
> +not accepted.

This is made redundant by the first paragraph above.

> +The filename may be an arbitrary nonempty string of bytes, as long as
> +it contains no '/' or NUL character.

s/, as long as it contains no/; it cannot contain any/

> +The associated object must be a valid blob if the mode indicates a
> +file or symlink, tree if it indicates a subdirectory, or commit if it
> +indicates a subproject. The blob associated to a symlink entry
> +indicates the link target and its content not have any embedded NULs.

I doubt that we should even mention "and its content not have ...".  It is
for readlink(2) and symlink(2) to decide.

> +Sorting
> +~~~~~~~
> +Entries are sorted by memcmp(3) on file name. No duplicate file names
> +allowed.

A sentence without a verb seen at the end of this paragraph.

> +COMMIT OBJECT
> +-------------
> +The commit object links a physical state of a tree with a description
> +of how we got there and why.

What is the intended audience and the purpose of this document?  If this
were to strictly define and describe the "structure", then "and why" is
inappropriate.  It is merely the best-current-practice at the human level
to describe the "why" in their commit log messages---it does not break the
structure if nobody explains "why".

On the other hand, "how we got there" is a good phrase to explain that by
refering to its immediate parents, all the previous histories are also
described.

> +... Commit object payload contains the
> +associated tree SHA-1, parent commits's SHA-1, author and comitter
> +information.

s/.$/, among other things./; as the log message is also part of the
payload.

Start by labeling what the large block of example you are going to throw
at the reader here.

> +------------------------------------------------
> +$ git cat-file commit 81d48f0aee54
> +tree 093f37084c133795e4ce71befa57185328737171
> +parent f5e4e20faa1eee3feaa0394897bbd1aca544e809
> +parent 661db794eb8179c7bea02f159bb691a2fff4a8e0
> +parent 14c173eb63432ba5d0783b6c4b23a8fe0c76fb0f
> +author Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> 1326576355 -0800
> +committer Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> 1326576355 -0800
> +mergetag object 661db794eb8179c7bea02f159bb691a2fff4a8e0
> + type commit
> + tag devicetree-for-linus
> + tagger Grant Likely <grant.likely@xxxxxxxxxxxx> 1326520038 -0700
> + 
> + 2nd set of device tree changes for v3.3
> + -----BEGIN PGP SIGNATURE-----
> + Version: GnuPG v1.4.11 (GNU/Linux)
> + 
> + iQIcBAABAgAGBQJPERbzAAoJEEFnBt12D9kBmDIP/R9Vspc6yhjSAEvdp/VET2gi
> + TgAQfdp4VuYjjIt4cUPO5UQU9kw478GjTuP2blZEC9DlG1jSf/L8U+A7FHJIVVzU

Elide the above like so:

        -----BEGIN PGP SIGNATURE-----
        Version: GnuPG v1.4.11 (GNU/Linux)

        iQIcBAABAgAGBQJPERbzAAoJEEFnBt12D9kBmDIP/R9Vspc6yhjSAEvdp/VET2gi
        TgAQfdp4VuYjjIt4cUPO5UQU9kw478GjTuP2blZEC9DlG1jSf/L8U+A7FHJIVVzU
	...
        =mup8
        -----END PGP SIGNATURE-----

> +Merge tags 'devicetree-for-linus' and 'spi-for-linus' of git://git.secretlab.ca/git/linux-2.6
> +
> +2nd set of device tree changes and SPI bug fixes for v3.3
> +
> +* tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux-2.6:
> +  of/irq: Add interrupts-names property to name an irq resource
> +  of/address: Add reg-names property to name an iomem resource
> +
> +* tag 'spi-for-linus' of git://git.secretlab.ca/git/linux-2.6:
> +  spi/tegra: depend instead of select TEGRA_SYSTEM_DMA
> +------------------------------------------------
> +
> +More precisely, a commit object begins with of one or more lines
> +delimited by ASCII LF. The end of the header is signalled by an empty
> +line. Any remaining text after the empty line forms the commit

Drop "More precisely, ".  Also notice that you abruptly said "end of the
header" without mentioning anything about "header" in the previous
sentence.

	A commit object begins with the "header" part, that consists of
	one or more lines delimited by LF, and the "body" part, that
	records the commit log message.  The first empty line delimits the
	header and the body.

> +The header must not contain NUL.

I vaguely recall that you made sure neither the header nor the body
contains NUL.

> +A "continuation line" in the header begins with an SP. The remainder
> +of the line, after removing that SP, is concatenated to the previous
> +line, while retaining the LF at the end of the previous line.
> +
> +When a line in the header begins with a letter other than SP, and has
> +at least one SP in it, it is called a "field". A field consists of the
> +"field name", which is the string before the first SP on the line, and
> +its "value", which is everything after that SP. When the value
> +consists of multiple lines, continuation lines are used.
> +
> +More than one field with the same name can appear in the header of an
> +object, and the order in which they appear is significant. A commit
> +object can contain these fields in the listed order:

s/can contain/contains/; as you are marking optional ones with "zero or".

> +1. one "tree" field with the 40-character textual object name of the
> +   associated tree object
> +2. zero or more "parent" fields, each with 40-character textual object
> +   name of the parent commit object
> +3. one "author" field with an ident string
> +4. one "committer" field with an ident string
> +5. zero or one "encoding" field with an ascii string

s/zero or one/optionally, one/ (not a strong preference--I just felt that
would be easier to read).

	After the above fields, other fields may follow, and new types of
	fields may be added in later versions of git.  Example of these
	optional fields are:

	- "mergetag" that copies the contents of a signed tag on one of
          the parent commit;
        - "gpgsig" that records a GPG signature for this commit object.

> +6. zero or more "mergetag" fields with associated tag object content
> +7. zero or one "gpgsig" field with gpg signature content

and exclude these two from the numbering above to make it clear they are
optional.

> +Ident strings
> +~~~~~~~~~~~~~
> +Ident strings record who's responsible of doing something at what
> +time. For a commit, the ident string in "author" line records who is
> +the author of the associated changes and when the changes are

s/are/were/, perhaps?  Again, what the purpose of this document?  If this
were more than to strictly describe the "structure", it is OK and even
preferable to leave the meaning the "author" as vague, but if this were
also to suggest the best current practice interpretation, it may be worth
to add something like

	There may be a case where it is difficult to attribute a commit to
        a single author; think of it as recording the primary contact, the
        person to ask any questions about the commit if needed later.

> +made. The ident string in "committer" line records who commits the

s/commits/committed/, perhaps?

> +changes to the repository and at what time.
> +
> +An ident string consists of an email address and a timestamp. More
> +precisely:

s/of an email/of a name, an email/;
s/. More precisely:/:/;

> +1. Optionally, a name
> +2. An email address wrapped around by `<` and `>`, followed by one
> +   space (ASCII SP)

The above makes it sound as if "A U Thor<author@xxxxxxxxxx>" is usual and
valid.  How about 

	1. A name, followed by one ASCII SP

and after this enumeration, say something like:

	Name may be missing in commit objects produced by repository
	conversion from other SCMs that do not have it.  Name and email
	are typically encoded in UTF-8.

even though I am not sure the last sentence should be in this document.

> +3. The number of seconds since Epoch (00:00:00 UTC, January 1, 1970)
> +   followed by a space (ASCII SP)
> +4. Timezone: either plus or minus sign, followed by 4 decimal digits
> +
> +Name and email are encoded in UTF-8 and must must not contain ASCII
> +NUL characters.

Drop " and must must ...characters"; you already said that the header does
not have any NUL.  As I already said, I am not sure if you should mention
"UTF-8" at all in this document.

> +Commit encoding
> +~~~~~~~~~~~~~~~
> +Encoding field describes that encoding that the commit message is
> +encoded in.

s/that encoding that/the character encoding in which/;
s/encoded in/recorded/;

> +... Encoding names must be recognized by iconv(3). By default,
> +commit message is in UTF-8. It's discouraged to use encodings that can
> +generate ASCII NUL characters.

Here we would probably want to have a paragraph each for "mergetag" and
"gpgsig".

> +TAG OBJECTS
> +-----------
> +Tag object payload contains an object, object type, tag name, the name
> +of the person ("tagger") who created the tag, and a message, which may
> +contain a signature.

s/a signature/a signature at the end/;

> +------------------------------------------------
> +$ git cat-file tag v1.5.0
> +object 437b1b20df4b356c9342dac8d38849f24ef44f27
> +type commit
> +tag v1.5.0
> +tagger Junio C Hamano <junkio@xxxxxxx> 1171411200 +0000
> +
> +GIT 1.5.0
> +-----BEGIN PGP SIGNATURE-----
> +Version: GnuPG v1.4.6 (GNU/Linux)
> +
> +iD8DBQBF0lGqwMbZpPMRm5oRAuRiAJ9ohBLd7s2kqjkKlq1qqC57SbnmzQCdG4ui
> +nLE/L9aUXdWeTFPron96DLA=
> +=2E+0
> +-----END PGP SIGNATURE-----
> +------------------------------------------------
> +
> +Tag object format resembles commit format. A tag commit may have the
> +following fields in listed order:
> +
> +1. one "object" field with 40-character textual object name of the
> +   tagged object
> +2. one "type" field with type of the tagged object ("commit", "tag",
> +   "blob", or "tree")
> +3. one "tag" field with the name of the tag
> +4. one "tagger" with an ident string
> +
> +New kinds of fields may be added in later versions of git.
> +
> +Any remaining text after the header forms the tag message. Tag message
> +has no specified encoding. Anything that does not contain ASCII NUL
> +characters are accepted.
> +
> +The object field must point to a valid object of type indicated by the
> +type field. The tag name can be an arbitrary string without NUL bytes
> +or embedded newlines; in practice it usually follows the restrictions
> +described in linkgit:git-check-ref-format[1].

A description of how the signature part is formed needs to come here.

> +GIT
> +---
> +Part of the linkgit:git[1] suite
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html