A multi-pack-index (MIDX) file indexes the objects in multiple packfiles in a single pack directory. After a simple fixed-size header, the version 1 file format uses chunks to specify different regions of the data that correspond to different types of data, including: - List of OIDs in lex-order - A fanout table into the OID list - List of packfile names (relative to pack directory) - List of object metadata - Large offsets (if needed) By adding extra optional chunks, we can easily extend this format without invalidating written v1 files. One value in the header corresponds to a number of "base MIDX files" and will always be zero until the value is used in a later patch. We considered using a hashtable format instead of an ordered list of objects to reduce the O(log N) lookups to O(1) lookups, but two main issues arose that lead us to abandon the idea: - Extra space required to ensure collision counts were low. - We need to identify the two lexicographically closest OIDs for fast abbreviations. Binary search allows this. The current solution presents multiple packfiles as if they were packed into a single packfile with one pack-index. Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> --- Documentation/technical/pack-format.txt | 85 +++++++++++++++++++++++++++++++++ 1 file changed, 85 insertions(+) diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt index 8e5bf60be3..ab459ef142 100644 --- a/Documentation/technical/pack-format.txt +++ b/Documentation/technical/pack-format.txt @@ -160,3 +160,88 @@ Pack file entry: <+ corresponding packfile. 20-byte SHA-1-checksum of all of the above. + +== midx-*.midx files have the following format: + +The multi-pack-index (MIDX) files refer to multiple pack-files. + +In order to allow extensions that add extra data to the MIDX format, we +organize the body into "chunks" and provide a lookup table at the beginning +of the body. The header includes certain length values, such as the number +of packs, the number of base MIDX files, hash lengths and types. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'M', 'I', 'D', 'X'} + + 4-byte version number: + Git currently only supports version 1. + + 1-byte Object Id Version (1 = SHA-1) + + 1-byte Object Id Length (H) + + 1-byte number (I) of base multi-pack-index files: + This value is currently always zero. + + 1-byte number (C) of "chunks" + + 4-byte number (P) of pack files + +CHUNK LOOKUP: + + (C + 1) * 12 bytes providing the chunk offsets: + First 4 bytes describe chunk id. Value 0 is a terminating label. + Other 8 bytes provide offset in current file for chunk to start. + (Chunks are provided in file-order, so you can infer the length + using the next chunk position if necessary.) + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of objects (N). The number of objects with first byte + value i is (F[i] - F[i-1]) for i > 0. + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all objects in the MIDX are stored in lexicographic + order in this chunk. + + Object Offsets (ID: {'O', 'O', 'F', 'F'}) (N * 8 bytes) + Stores two 4-byte values for every object. + 1: The pack-int-id for the pack storing this object. + 2: The offset within the pack. + If all offsets are less than 2^31, then the large offset chunk + will not exist and offsets are stored as in IDX v1. + If there is at least one offset value larger than 2^32-1, then + the large offset chunk must exist. If the large offset chunk + exists and the 31st bit is on, then removing that bit reveals + the row in the large offsets containing the 8-byte offset of + this object. + + [Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'}) + 8-byte offsets into large packfiles. + + Packfile Name Lookup (ID: {'P', 'L', 'O', 'O'}) (P * 4 bytes) + P * 4 bytes storing the offset in the packfile name chunk for + the null-terminated string containing the filename for the + ith packfile. The filename is relative to the MIDX file's parent + directory. + + Packfile Names (ID: {'P', 'N', 'A', 'M'}) + Stores the packfile names as concatenated, null-terminated strings. + Packfiles must be listed in lexicographic order for fast lookups by + name. This is the only chunk not guaranteed to be a multiple of four + bytes in length, so it should be the last chunk for alignment reasons. + +TRAILER: + + H-byte HASH-checksum of all of the above. -- 2.15.0