Good documentation, but some nitpicks are needed before it hits Documentation/technical/ part of the source tree. > <TREE_ENTRIES> > # Tree entries are sorted by the byte sequence that comprises > # the entry name. > : ( <TREE_ENTRY> )* > ; Not quite. An entry for a subtree is sorted as if a '/' is suffixed to its name. $ git ls-tree $T 100644 blob 2398e9f8892812607f5eee6ed0d5712c2e3de197 a- 100644 blob 7f07527a80bd8c2b1c5087d7ccfe61073b068374 a-b 040000 tree 23fddf6a57ff3ba98aa93fb71431276c3f1a3c40 a 100644 blob 2afe6dcc5466068b8dcc7263cece05d2adf044fe a= 100644 blob efc73add7dd868242a66faf2a59b145f2a60b834 a=b This is, by the way, consistent with the order of cache entries in the index file. $ git ls-files -s 100644 2398e9f8892812607f5eee6ed0d5712c2e3de197 0 a- 100644 7f07527a80bd8c2b1c5087d7ccfe61073b068374 0 a-b 100644 0ee729686ab2a0074639c5f64930648571e7c4b2 0 a/b 100644 2afe6dcc5466068b8dcc7263cece05d2adf044fe 0 a= 100644 efc73add7dd868242a66faf2a59b145f2a60b834 0 a=b > <TREE_ENTRY> > # The type of the object referenced MUST be appropriate for > # the mode. Regular files and symbolic links reference a BLOB > # and directories reference a TREE. > : <OCTAL_MODE> <SP> <NAME> <NUL> <BINARY_OBJ_ID> > ; As you correctly explain later, OCTAL_MODE must be minimal; "git ls-tree" output says 040000 in the above example, but the actual object records it as 40000. > <TAG_CONTENTS> > : "object" <SP> <HEX_OBJ_ID> <LF> > "type" <SP> <NONTAG_OBJ_TYPE> <LF> > "tag" <SP> <TAG_NAME> <LF> > <LF> > <DATA> > ; A tag can tag another tag (think of chain of trust), so what follows "type" does not have to be NONTAG_OBJ_TYPE. > <OCTAL_MODE> > # Octal encoding, without prefix, of the file system object > # type and permission bits. The bit layout is according to the > # POSIX standard, with only regular files, directories, and > # symbolic links permitted. The actual permission bits are > # all zero except for regular files. The only permission bit > # of any consequence to Git is the owner executable bit. By > # default, the permission bits for files will be either 0644 > # or 0755, depending on the owner executable bit. > ; It's not really "by default" -- more like "by definition", since there is no way for the program to use something different. We used to record non-canonical modes in ancient versions of git, but I think fsck-objects would warn on objects created that way. > <NONTAG_OBJ_TYPE> > : "BLOB" > | "TREE" > | "COMMIT" > ; Drop this definition, and make the literals part of <OBJ_TYPE>, after lowercasing them ;-). > <OBJ_TYPE> > : <NONTAG_OBJ_TYPE> > | "TAG" > ; > PACK FILE > --------- > # The name of a pack file is "pack-${PACK_ID}.pack", where ${PACK_ID} > # is the hexidecimal encoding (lower case) of the SHA-1 digest of the > # sorted list of binary object IDs in the pack file without a separator > # between the object IDs. Initially, the ${PACK_ID} for a pack was not > # defined, making the value effectively random. Although the really core level does not care, a PACK_ID is required to be unique (within a object store and its alternates) 40-byte hexadecimal for http commit walker to work properly. BTW, I still have a patch to tighten the check to enforce this as part of the consistency check. > <PACKED_OBJECT_DATA> > : _deflate_( <DATA> ) > | <BINARY_OBJ_ID> _deflate_( <DELTA_DATA> ) > ; It might be cleaner to separate this definition into two. That is, one packed object is either non-delta-type base128 type-length followed by deflated data, or delta-type base128 type-length followed by base object id followed by deflated delta. > PACK INDEX > ---------- > # The name of a pack file index is "pack-${PACK_ID}.idx", where > # ${PACK_ID} is the hexidecimal encoding (lower case) of the SHA-1 > # digest of the sorted list of binary object IDs in the pack file > # without a separator between the object IDs. Initially, the ${PACK_ID} > # for a pack was not defined, making the value effectively random. I would not repeat ", where ${PACK_ID} is..." part, which was done in the description of the pack file. Rather, ", where ${PACK_ID} is same as the .pack file the .idx file corresponds to", would be more appropriate. > <INDEX_PARTIAL_COUNT> > # 32 bit, network byte order, binary integer of the count of > # objects in the pack file with the first byte of the object > # ID less than or equal to the index of the count, starting > # from zero. > ; Linus and I call this part "fan-out". > <ENTRY_NAME> > # File system entity name. Path is normalized and relative to > # the working directory. > ; Did you mention that the index entries are sorted by name? > <INDEX_EXTENSION_NAME> > # 4 byte sequence identifying how the <INDEX_EXTENSION_DATA> > # should be interpreted. The first byte having a value greater > # than or equal to the ASCII character 'A' (0x41) and less than > # or equal to the ASCII character 'Z' (0x5a). > ; This is not true, but the code needs better comments. The intention is that an extended section whose name starts with a capital letter (such as "cache-tree extension" whose name is "TREE") is purely optional, and if a software of different version does not understand it, it can still safely keep using the rest of the index. If somebody introduces a new extended section that _must_ be interpreted in order to fully understand what the index file records, such an extended section can signal that by having a name that do not start with a capital. A version of the software that does understand such extended sections would have a case arm that covers such a name in the switch statement you took this 'A' .. 'Z' from. - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html