Commentary: This file format uses the large offsets from the pack-index version 2 format, but drops the CRC32 hashes from that format. Also: I included the HASH footer at the end only because it is already in the pack and pack-index formats, but not because it is particularly useful here. If possible, I'd like to remove it and speed up MIDX writes. -- >8 -- Add a technical documentation file describing the design for the multi-pack index (MIDX). Includes current limitations and future work. Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> --- Documentation/technical/multi-pack-index.txt | 149 +++++++++++++++++++++++++++ 1 file changed, 149 insertions(+) create mode 100644 Documentation/technical/multi-pack-index.txt diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt new file mode 100644 index 0000000000..d31b03dec5 --- /dev/null +++ b/Documentation/technical/multi-pack-index.txt @@ -0,0 +1,149 @@ +Multi-Pack-Index (MIDX) Design Notes +==================================== + +The Git object directory contains a 'pack' directory containing +packfiles (with suffix ".pack") and pack-indexes (with suffix +".idx"). The pack-indexes provide a way to lookup objects and +navigate to their offset within the pack, but these must come +in pairs with the packfiles. This pairing depends on the file +names, as the pack-index differs only in suffix with its pack- +file. While the pack-indexes provide fast lookup per packfile, +this performance degrades as the number of packfiles increases, +because abbreviations need to inspect every packfile and we are +more likely to have a miss on our most-recently-used packfile. +For some large repositories, repacking into a single packfile +is not feasible due to storage space or excessive repack times. + +The multi-pack-index (MIDX for short, with suffix ".midx") +stores a list of objects and their offsets into multiple pack- +files. It contains: + +- A list of packfile names. +- A sorted list of object IDs. +- A list of metadata for the ith object ID including: + - A value j referring to the jth packfile. + - An offset within the jth packfile for the object. +- If large offsets are required, we use another list of large + offsets similar to version 2 pack-indexes. + +Thus, we can provide O(log N) lookup time for any number +of packfiles. + +A new config setting 'core.midx' must be enabled before writing +or reading MIDX files. + +The MIDX files are updated by the 'midx' builtin with the +following common parameter combinations: + +- 'git midx' gives the hash of the current MIDX head. +- 'git midx --write --update-head --delete-expired' writes a new + MIDX file, points the MIDX head to that file, and deletes the + existing MIDX file if out-of-date. +- 'git midx --read' lists some basic information about the current + MIDX head. Used for basic tests. +- 'git midx --clear' deletes the current MIDX head. + +Design Details +-------------- + +- The MIDX file refers only to packfiles in the same directory + as the MIDX file. + +- A special file, 'midx-head', stores the hash of the latest + MIDX file so we can load the file without performing a dirstat. + This file is especially important with incremental MIDX files, + pointing to the newest file. + +- If a packfile exists in the pack directory but is not referenced + by the MIDX file, then the packfile is loaded into the packed_git + list and Git can access the objects as usual. This behavior is + necessary since other tools could add packfiles to the pack + directory without notifying Git. + +- The MIDX file should be only a supplemental structure. If a + user downgrades or disables the `core.midx` config setting, + then the existing .idx and .pack files should be sufficient + to operate correctly. + +- The file format includes parameters for the object id length + and hash algorithm, so a future change of hash algorithm does + not require a change in format. + +- If an object appears in multiple packfiles, then only one copy + is stored in the MIDX. This has a possible performance issue: + If an object appears as the delta-base of multiple objects from + multiple packs, then cross-pack delta calculations may slow down. + This is currently only theoretical and has not been demonstrated + to be a measurable issue. + +Current Limitations +------------------- + +- MIDX files are managed only by the midx builtin and is not + automatically updated on clone or fetch. + +- There is no '--verify' option for the midx builtin to verify + the contents of the MIDX file against the pack contents. + +- Constructing a MIDX file currently requires the single-pack + index for every pack being added to the MIDX. + +- The fsck builtin does not check MIDX files, but should. + +- The repack builtin is not aware of the MIDX files, and may + invalidate the MIDX files by deleting existing packfiles. The + MIDX may also be extended in the future to store metadata about + a packfile that can be used for faster repack commands. + +- The naive Git HTTP server advertises lists of packfiles using + the file system directly. + +Future Work +----------- + +- The current file-format requires between 28 and 36 bytes per + object. As the repository grows, the MIDX file can become + very large and become a bottleneck when updating the file. To + fix this "big write" problem, we can make the MIDX file + incremental. Instead of just one MIDX file, we will have a + sequence of MIDX files that can be unioned together. Then + on write we take the new objects to add and consider how many + existing files should be merged into a new file containing + the latest objects. + + This list of "base indexes" will be presented as an optional + chunk in the MIDX format and contains the OIDs for the base + files. Thus, the `midx_head` file only stores the OID for the + "tip" MIDX file and then the rest are loaded based on those + pointers, such as the following figure: + + [ BIG ] <- [ MEDIUM ] <- [tiny] <- midx_head + ^___________________________| + + The plan being that every write replaces the "tiny" index, + and when that index becomes large enough it merges with the + "medium" index and a new tiny index is created in the next + write. Very rarely, the "big" index would be updated, causing + a slow write. + +- After the MIDX feature is sufficiently hardened and widely used, + consider making Git more fully depend on the MIDX file. If MIDX + is the default, then we can delete the single-pack-indexes from + the pack directory. We could also allow thin packs in the pack + directory. + +- The MIDX could be extended to store a "stable object order" such + that adding objects to the order does not change the existing + objects. This would enable re-using the reachability bitmaps after + repacking and updating the MIDX file. + +Related Links +------------- + +[0] https://bugs.chromium.org/p/git/issues/detail?id=6 + Chromium work item for: Multi-Pack Index (MIDX) + +[1] https://public-inbox.org/git/CB5074CF.3AD7A%25joshua.redstone@xxxxxx/T/#u + Subject: Git performance results on a large repository + Date: 3 Feb 2012 + -- 2.15.0