The multi-pack-index provides a fast way to find an object among a large list of pack-files. It stores a single pack-reference for each object id, so duplicate objects are ignored. Among a list of pack-files storing the same object, the most-recently modified one is used. Create new subcommands for the multi-pack-index builtin. * 'git multi-pack-index expire': If we have a pack-file indexed by the multi-pack-index, but all objects in that pack are duplicated in more-recently modified packs, then delete that pack (and any others like it). Delete the reference to that pack in the multi-pack-index. * 'git multi-pack-index repack --batch-size=<size>': Starting from the oldest pack-files covered by the multi-pack-index, find those whose "expected size" is below the batch size until we have a collection of packs whose expected sizes add up to the batch size. We compute the expected size by multiplying the number of referenced objects by the pack-size and dividing by the total number of objects in the pack. If the batch-size is zero, then select all packs. Create a new pack containing all objects that the multi-pack-index references to those packs. This allows us to create a new pattern for repacking objects: run 'repack'. After enough time has passed that all Git commands that started before the last 'repack' are finished, run 'expire' again. This approach has some advantages over the existing "repack everything" model: 1. Incremental. We can repack a small batch of objects at a time, instead of repacking all reachable objects. We can also limit ourselves to the objects that do not appear in newer pack-files. 2. Highly Available. By adding a new pack-file (and not deleting the old pack-files) we do not interrupt concurrent Git commands, and do not suffer performance degradation. By expiring only pack-files that have no referenced objects, we know that Git commands that are doing normal object lookups* will not be interrupted. * Note: if someone concurrently runs a Git command that uses get_all_packs(), * then that command could try to read the pack-files and pack-indexes that we * are deleting during an expire command. Such commands are usually related to * object maintenance (i.e. fsck, gc, pack-objects) or are related to * less-often-used features (i.e. fast-import, http-backend, server-info). We **are using** this approach in VFS for Git to do background maintenance of the "shared object cache" which is a Git alternate directory filled with packfiles containing commits and trees. We currently download pack-files on an hourly basis to keep up-to-date with the central server. The cache servers supply packs on an hourly and daily basis, so most of the hourly packs become useless after a new daily pack is downloaded. The 'expire' command would clear out most of those packs, but many will still remain with fewer than 100 objects remaining. The 'repack' command (with a batch size of 1-3gb, probably) can condense the remaining packs in commands that run for 1-3 min at a time. Since the daily packs range from 100-250mb, we will also combine and condense those packs. Updates in V5: * Fixed the error in PATCH 7 due to a missing line that existed in PATCH 8. Thanks, Josh Steadmon! * The 'repack' subcommand now computes the "expected size" of a pack instead of relying on the total size of the pack. This is actually really important to the way VFS for Git uses prefetch packs, and some packs are not being repacked because the pack size is larger than the batch size, but really there are only a few referenced objects. * The 'repack' subcommand now allows a batch size of zero to mean "create one pack containing all objects in the multi-pack-index". A new commit adds a test that hits the boundary cases here, but follows the 'expire' subcommand so we can show that cycle of repack-then-expire to safely replace the packs. Junio: It appears that there are some conflicts with the trace2 changes in master. These are not new to the updates in this version. I saw how you resolved these conflicts and replaying that resolution should work for you. Thanks, -Stolee Derrick Stolee (11): repack: refactor pack deletion for future use Docs: rearrange subcommands for multi-pack-index multi-pack-index: prepare for 'expire' subcommand midx: simplify computation of pack name lengths midx: refactor permutation logic and pack sorting multi-pack-index: implement 'expire' subcommand multi-pack-index: prepare 'repack' subcommand midx: implement midx_repack() multi-pack-index: test expire while adding packs midx: add test that 'expire' respects .keep files t5319-multi-pack-index.sh: test batch size zero Documentation/git-multi-pack-index.txt | 32 +- builtin/multi-pack-index.c | 14 +- builtin/repack.c | 14 +- midx.c | 440 +++++++++++++++++++------ midx.h | 2 + packfile.c | 28 ++ packfile.h | 7 + t/t5319-multi-pack-index.sh | 184 +++++++++++ 8 files changed, 602 insertions(+), 119 deletions(-) base-commit: 26aa9fc81d4c7f6c3b456a29da0b7ec72e5c6595 -- 2.21.0.1096.g1c91fdc207 Range diff against v4: 1: 62b393b816 = 1: 62b393b816 repack: refactor pack deletion for future use 2: 7886785904 = 2: 7886785904 Docs: rearrange subcommands for multi-pack-index 3: 628ca46036 = 3: 628ca46036 multi-pack-index: prepare for 'expire' subcommand 4: d55c1d7ee7 = 4: d55c1d7ee7 midx: simplify computation of pack name lengths 5: 3950743b96 = 5: 3950743b96 midx: refactor permutation logic and pack sorting 6: 6691d97902 = 6: 6691d97902 multi-pack-index: implement 'expire' subcommand 7: f5a8ff21dd ! 7: e66e383231 multi-pack-index: prepare 'repack' subcommand @@ -42,14 +42,20 @@ +repack:: + Create a new pack-file containing objects in small pack-files -+ referenced by the multi-pack-index. Select the pack-files by -+ examining packs from oldest-to-newest, adding a pack if its -+ size is below the batch size. Stop adding packs when the sum -+ of sizes of the added packs is above the batch size. If the -+ total size does not reach the batch size, then do nothing. -+ Rewrite the multi-pack-index to reference the new pack-file. -+ A later run of 'git multi-pack-index expire' will delete the -+ pack-files that were part of this batch. ++ referenced by the multi-pack-index. If the size given by the ++ `--batch-size=<size>` argument is zero, then create a pack ++ containing all objects referenced by the multi-pack-index. For ++ a non-zero batch size, Select the pack-files by examining packs ++ from oldest-to-newest, computing the "expected size" by counting ++ the number of objects in the pack referenced by the ++ multi-pack-index, then divide by the total number of objects in ++ the pack and multiply by the pack size. We select packs with ++ expected size below the batch size until the set of packs have ++ total expected size at least the batch size. If the total size ++ does not reach the batch size, then do nothing. If a new pack- ++ file is created, rewrite the multi-pack-index to reference the ++ new pack-file. A later run of 'git multi-pack-index expire' will ++ delete the pack-files that were part of this batch. + EXAMPLES @@ -130,6 +136,16 @@ diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh +@@ + git pack-objects --revs .git/objects/pack/pack-E <<-EOF && + refs/heads/E + EOF +- git multi-pack-index write ++ git multi-pack-index write && ++ cp -r .git/objects/pack .git/objects/pack-backup + ) + ' + @@ ) ' 8: ba1a1c7bbb ! 8: 8e2a1140a5 midx: implement midx_repack() @@ -2,20 +2,35 @@ midx: implement midx_repack() - To repack using a multi-pack-index, first sort all pack-files by + To repack with a non-zero batch-size, first sort all pack-files by their modified time. Second, walk those pack-files from oldest - to newest, adding the packs to a list if they are smaller than the - given pack-size. Finally, collect the objects from the multi-pack- - index that are in those packs and send them to 'git pack-objects'. + to newest, compute their expected size, and add the packs to a list + if they are smaller than the given batch-size. Stop when the total + expected size is at least the batch size. + + If the batch size is zero, select all packs in the multi-pack-index. + + Finally, collect the objects from the multi-pack-index that are in + the selected packs and send them to 'git pack-objects'. Write a new + multi-pack-index that includes the new pack. + + Using a batch size of zero is very similar to a standard 'git repack' + command, except that we do not delete the old packs and instead rely + on the new multi-pack-index to prevent new processes from reading the + old packs. This does not disrupt other Git processes that are currently + reading the old packs based on the old multi-pack-index. While first designing a 'git multi-pack-index repack' operation, I - started by collecting the batches based on the size of the objects - instead of the size of the pack-files. This allows repacking a - large pack-file that has very few referencd objects. However, this + started by collecting the batches based on the actual size of the + objects instead of the size of the pack-files. This allows repacking + a large pack-file that has very few referencd objects. However, this came at a significant cost of parsing pack-files instead of simply reading the multi-pack-index and getting the file information for - the pack-files. This object-size idea could be a direction for - future expansion in this area. + the pack-files. The "expected size" version provides similar + behavior, but could skip a pack-file if the average object size is + much larger than the actual size of the referenced objects, or + can create a large pack if the actual size of the referenced objects + is larger than the expected size. Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> @@ -35,69 +50,111 @@ } -int midx_repack(const char *object_dir, size_t batch_size) -+struct time_and_id { ++struct repack_info { + timestamp_t mtime; ++ uint32_t referenced_objects; + uint32_t pack_int_id; +}; + +static int compare_by_mtime(const void *a_, const void *b_) { -+ const struct time_and_id *a, *b; ++ const struct repack_info *a, *b; + -+ a = (const struct time_and_id *)a_; -+ b = (const struct time_and_id *)b_; ++ a = (const struct repack_info *)a_; ++ b = (const struct repack_info *)b_; + + if (a->mtime < b->mtime) + return -1; + if (a->mtime > b->mtime) + return 1; - return 0; - } ++ return 0; ++} + -+int midx_repack(const char *object_dir, size_t batch_size) ++static int fill_included_packs_all(struct multi_pack_index *m, ++ unsigned char *include_pack) +{ -+ int result = 0; -+ uint32_t i, packs_to_repack; -+ size_t total_size; -+ struct time_and_id *pack_ti; -+ unsigned char *include_pack; -+ struct child_process cmd = CHILD_PROCESS_INIT; -+ struct strbuf base_name = STRBUF_INIT; -+ struct multi_pack_index *m = load_multi_pack_index(object_dir, 1); ++ uint32_t i; + -+ if (!m) -+ return 0; ++ for (i = 0; i < m->num_packs; i++) ++ include_pack[i] = 1; + -+ include_pack = xcalloc(m->num_packs, sizeof(unsigned char)); -+ pack_ti = xcalloc(m->num_packs, sizeof(struct time_and_id)); ++ return m->num_packs < 2; ++} ++ ++static int fill_included_packs_batch(struct multi_pack_index *m, ++ unsigned char *include_pack, ++ size_t batch_size) ++{ ++ uint32_t i, packs_to_repack; ++ size_t total_size; ++ struct repack_info *pack_info = xcalloc(m->num_packs, sizeof(struct repack_info)); + + for (i = 0; i < m->num_packs; i++) { -+ pack_ti[i].pack_int_id = i; ++ pack_info[i].pack_int_id = i; + + if (prepare_midx_pack(m, i)) + continue; + -+ pack_ti[i].mtime = m->packs[i]->mtime; ++ pack_info[i].mtime = m->packs[i]->mtime; ++ } ++ ++ for (i = 0; batch_size && i < m->num_objects; i++) { ++ uint32_t pack_int_id = nth_midxed_pack_int_id(m, i); ++ pack_info[pack_int_id].referenced_objects++; + } -+ QSORT(pack_ti, m->num_packs, compare_by_mtime); ++ ++ QSORT(pack_info, m->num_packs, compare_by_mtime); + + total_size = 0; + packs_to_repack = 0; + for (i = 0; total_size < batch_size && i < m->num_packs; i++) { -+ int pack_int_id = pack_ti[i].pack_int_id; ++ int pack_int_id = pack_info[i].pack_int_id; + struct packed_git *p = m->packs[pack_int_id]; ++ size_t expected_size; + + if (!p) + continue; -+ if (p->pack_size >= batch_size) ++ if (open_pack_index(p) || !p->num_objects) ++ continue; ++ ++ expected_size = (size_t)(p->pack_size ++ * pack_info[i].referenced_objects); ++ expected_size /= p->num_objects; ++ ++ if (expected_size >= batch_size) + continue; + + packs_to_repack++; -+ total_size += p->pack_size; ++ total_size += expected_size; + include_pack[pack_int_id] = 1; + } + ++ free(pack_info); ++ + if (total_size < batch_size || packs_to_repack < 2) ++ return 1; ++ + return 0; ++} ++ ++int midx_repack(const char *object_dir, size_t batch_size) ++{ ++ int result = 0; ++ uint32_t i; ++ unsigned char *include_pack; ++ struct child_process cmd = CHILD_PROCESS_INIT; ++ struct strbuf base_name = STRBUF_INIT; ++ struct multi_pack_index *m = load_multi_pack_index(object_dir, 1); ++ ++ if (!m) ++ return 0; ++ ++ include_pack = xcalloc(m->num_packs, sizeof(unsigned char)); ++ ++ if (batch_size) { ++ if (fill_included_packs_batch(m, include_pack, batch_size)) ++ goto cleanup; ++ } else if (fill_included_packs_all(m, include_pack)) + goto cleanup; + + argv_array_push(&cmd.args, "pack-objects"); @@ -142,23 +199,12 @@ + if (m) + close_midx(m); + free(include_pack); -+ free(pack_ti); + return result; -+} + } diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh -@@ - git pack-objects --revs .git/objects/pack/pack-E <<-EOF && - refs/heads/E - EOF -- git multi-pack-index write -+ git multi-pack-index write && -+ cp -r .git/objects/pack .git/objects/pack-backup - ) - ' - @@ ) ' 9: b1c6892417 = 9: f1e4669bf6 multi-pack-index: test expire while adding packs 10: 481b08890f = 10: b1e9df9980 midx: add test that 'expire' respects .keep files -: ---------- > 11: f6217e9fae t5319-multi-pack-index.sh: test batch size zero