On Thu, May 10, 2018 at 8:09 AM, Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> wrote: > The current document mentions OBJ_* constants without their actual > values. A git developer would know these are from cache.h but that's > not very friendly to a person who wants to read this file to implement > a pack file parser. > > Similarly, the deltified representation is not documented at all (the > "document" is basically patch-delta.c). Translate that C code to > English with a bit more about what ofs-delta and ref-delta mean. > > Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> > --- > This is a much better description than v1. I hope. > > Documentation/technical/pack-format.txt | 78 +++++++++++++++++++++++++ > cache.h | 5 ++ > 2 files changed, 83 insertions(+) > > diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt > index 8e5bf60be3..d20bf592aa 100644 > --- a/Documentation/technical/pack-format.txt > +++ b/Documentation/technical/pack-format.txt > @@ -36,6 +36,84 @@ Git pack format > > - The trailer records 20-byte SHA-1 checksum of all of the above. > > +=== Object types > + > +Valid object types are: > + > +- OBJ_COMMIT (1) > +- OBJ_TREE (2) > +- OBJ_BLOB (3) > +- OBJ_TAG (4) > +- OBJ_OFS_DELTA (6) > +- OBJ_REF_DELTA (7) > + > +Type 5 is reserved for future expansion. Type 0 is invalid. > + > +=== Deltified representation > + > +Conceptually there are only four object types: commit, tree, tag and > +blob. However to save space, an object could be stored as a "delta" of > +another "base" object. These representations are assigned new types > +ofs-delta and ref-delta, which is only valid in a pack file. ...only valid... as opposed to loose objects or as opposed to referencing cross-packs? I would think the former, not the latter. > +Both ofs-delta and ref-delta store the "delta" against another > +object. The difference between them is, ref-delta directly encodes > +20-byte base object name. If the base object is in the same pack, > +ofs-delta encodes the offset of the base object in the pack instead. Reading this paragraph clears up the question from before. The ref delta is a delta to another "reference by hash id (sha1)". What abbreviation is OFS? OFfSet ? > +The delta data is a sequence of instructions to reconstruct an object > +from the base object. As said before the base object is of type 1..4, we do not "delta-on-delta" yet, but to construct the object we have to create the base object first, which itself can be represented as a deltified object leading to a delta chain. > Each instruction appends more and more data to > +the target object until it's complete. There are two supported > +instructions so far: one for copy a byte range from the source object > +and one for inserting new data embedded in the instruction itself. ok. So there are 2 types of instructions, "copy from (offset, size)" and "new data follows". The next paragraphs seem to describe the copy instruction, maybe add a sub-headline here? > +Each instruction has variable length. Instruction type is determined > +by the seventh bit of the first octet. The following diagrams follow > +the convention in RFC 1951 (Deflate compressed data format). > + > + +----------+---------+---------+---------+---------+-------+-------+-------+ > + | 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 | > + +----------+---------+---------+---------+---------+-------+-------+-------+ > + > +This is the instruction format to copy a byte range from the source > +object. It encodes the offset to copy from any the number of bytes to > +copy. Offset and size are in little-endian order. > + > +All offset and size bytes are optional. This is to reduce the > +instruction size when encoding small offsets or sizes. The first seven > +bits in the first octet determines which of the next seven octets is > +present. If bit zero is set, offset1 is present. If bit one is set > +offset2 is present and so on. > + > +Note that a more compact instruction does not change offset and size > +encoding. For example, if only offset2 is omitted like below, offset3 > +still contains bits 16-23. It does not become offset2 and contains > +bits 8-15 even if it's right next to offset1. > + > + +----------+---------+---------+ > + | 10000101 | offset1 | offset3 | > + +----------+---------+---------+ It reads very fluently to here. > +In its most compact form, this instruction only takes up one byte > +(0x80) with both offset and size omitted, which will have default > +values zero. There is another exception: size zero is automatically > +converted to 0x10000. This "another exception" sounds a bit tacked on, but is still understandable. I would imagine that the size of 0 is used frequently to copy large blocks and coincidentally it is represented using the lowest number of bytes for size. Cute! Before the next diagram we could have a sub-headline, indicating that the other instruction "new data follows" will now be described. > + +----------+============+ > + | 0xxxxxxx | data | > + +----------+============+ > + > +This is the instruction to construct target object without the base > +object. The following data is appended to the target object. The first > +seven bits of the first octet determines the size of data in > +bytes. The size must be non-zero. This command sounds very easy. However we can have at most 127 bytes of new data, so if someone adds a larger part of new code, we'd have many "insert new data" instructions, all at the max of 127, such that the overhead for instruction bytes is 1/127 = 0.7 %. Sounds efficient. > + +----------+============ > + | 00000000 | > + +----------+============ > + > +This is the instruction reserved for future expansion. Thanks for pointing this out. > > +/* > + * Values in this enum (except those outside the 3 bit range) are part > + * of pack file format. See Documentation/technical/pack-format.txt > + * for more information. > + */ Makes sense. I really like this patch very much. Thanks for writing it. My annotations are just to add the cherry onto the cake, the current form is Reviewed-by: Stefan Beller <sbeller@xxxxxxxxxx> Thanks!