Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> writes: > Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> > --- > For my education but may help people who are interested in the > format. Most is gathered from commit messages, except the delta tree > entries. Thanks. > diff --git a/Documentation/technical/pack-format-v4.txt b/Documentation/technical/pack-format-v4.txt In the final version it may be a good idea to either have this together with the documentation for the existing pack-formats, or add a reference from the documentation for the existing formats to point at this new file saying "for v4 see ...". > new file mode 100644 > index 0000000..9123a53 > --- /dev/null > +++ b/Documentation/technical/pack-format-v4.txt > @@ -0,0 +1,110 @@ > +Git pack v4 format > +================== > + > +== pack-*.pack files have the following format: > + > + - A header appears at the beginning and consists of the following: > + > + 4-byte signature: > + The signature is: {'P', 'A', 'C', 'K'} > + > + 4-byte version number (network byte order): must be version > + number 4 > + > + 4-byte number of objects contained in the pack (network byte > + order) > + > + - (20 * nr_objects)-byte SHA-1 table: sorted in memcmp() order. > + > + - Commit name dictionary: the uncompressed length in variable > + encoding, followed by zlib-compressed dictionary. Each entry > + consists of two prefix bytes storing timezone followed by a > + NUL-terminated string. The log and code use different names to call this thing. "commit name" is misleading (e.g. it is not "commit object name", but "names recorded in commit objects"; it is not only for "committer" names, but also applies to authors; it is not just names but also emails and TZ used). Perhaps a better name would be "ident" table, as we use the word "ident" only to refer to data to refer to people who are recorded on either author/committer/tagger lines of the objects? > + (undeltified representation) > + n-byte type and length (4-bit type, (n-1)*7+4-bit length) > + [uncompressed data] > + [compressed data] These two lines are not useful; it is better spelled as [data specific to object type] as you have to enumerate what are stored and how for each type separately anyway. > +=== Tree representation > + > + - n-byte type and length (4-bit type, (n-1)*7+4-bit length) > + > + - Number of trees in variable length encoding > + > + - A number of trees, each consists of The above "number of trees" sounds both wrong; aren't they the number of "tree entries" (that can be blobs or subtrees) this tree object records? > + Path component reference: an index, in variable length encoding, > + into tree path dictionary, which also covers entry mode. > + > + SHA-1 in SHA-1 reference encoding. > + > +Path component reference zero is an indicator of deltified portion and > +has the following format: > + > + - path component reference: zero > + > + - index of the entry to copy from, in variable length encoding > + > + - number of entries in variable length encoding > + > + - base tree in SHA-1 reference encoding > + > +=== SHA-1 reference encoding > + > +This encoding is used to encode SHA-1 efficiently if it's already in > +the SHA-1 table. It starts with an index number in variable length > +encoding. If it's not zero, its value minus one is the index in the > +SHA-1 table. If it's zero, 20 bytes of SHA-1 is followed. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html