The multi-pack-index provides a fast way to find an object among a large list of pack-files. It stores a single pack-reference for each object id, so duplicate objects are ignored. Among a list of pack-files storing the same object, the most-recently modified one is used. Create new subcommands for the multi-pack-index builtin. * 'git multi-pack-index expire': If we have a pack-file indexed by the multi-pack-index, but all objects in that pack are duplicated in more-recently modified packs, then delete that pack (and any others like it). Delete the reference to that pack in the multi-pack-index. * 'git multi-pack-index repack --batch-size=': Starting from the oldest pack-files covered by the multi-pack-index, find those whose on-disk size is below the batch size until we have a collection of packs whose sizes add up to the batch size. Create a new pack containing all objects that the multi-pack-index references to those packs. This allows us to create a new pattern for repacking objects: run 'repack'. After enough time has passed that all Git commands that started before the last 'repack' are finished, run 'expire' again. This approach has some advantages over the existing "repack everything" model: 1. Incremental. We can repack a small batch of objects at a time, instead of repacking all reachable objects. We can also limit ourselves to the objects that do not appear in newer pack-files. 2. Highly Available. By adding a new pack-file (and not deleting the old pack-files) we do not interrupt concurrent Git commands, and do not suffer performance degradation. By expiring only pack-files that have no referenced objects, we know that Git commands that are doing normal object lookups* will not be interrupted. 3. Note: if someone concurrently runs a Git command that uses get_all_packs(), then that command could try to read the pack-files and pack-indexes that we are deleting during an expire command. Such commands are usually related to object maintenance (i.e. fsck, gc, pack-objects) or are related to less-often-used features (i.e. fast-import, http-backend, server-info). We plan to use this approach in VFS for Git to do background maintenance of the "shared object cache" which is a Git alternate directory filled with packfiles containing commits and trees. We currently download pack-files on an hourly basis to keep up-to-date with the central server. The cache servers supply packs on an hourly and daily basis, so most of the hourly packs become useless after a new daily pack is downloaded. The 'expire' command would clear out most of those packs, but many will still remain with fewer than 100 objects remaining. The 'repack' command (with a batch size of 1-3gb, probably) can condense the remaining packs in commands that run for 1-3 min at a time. Since the daily packs range from 100-250mb, we will also combine and condense those packs. Updates in V2: * Added a method, unlink_pack_path() to remove packfiles, but with the additional check for a .keep file. This borrows logic from builtin/repack.c. * Modified documentation and commit messages to replace 'verb' with 'subcommand'. Simplified the documentation. (I left 'verbs' in the title of the cover letter for consistency.) Updates in V3: * There was a bug in the expire logic when simultaneously removing packs and adding uncovered packs, specifically around the pack permutation. This was hard to see during review because I was using the 'pack_perm' array for multiple purposes. First, I was reducing its length, and then I was adding to it and resorting. In V3, I significantly overhauled the logic here, which required some extra commits before implementing 'expire'. The final commit includes a test that would cover this case. Updates in V4: * More 'verb' and 'command' instances replaced with 'subcommand'. I grepped the patch to check these should be fixed everywhere. * Update the tests to check .keep files (in last patch). * Modify the tests to show the terminating condition of --batch-size when there are three packs that fit under the size, but the first two are large enough to stop adding packs. This required rearranging the packs slightly to get different sizes than we had before. Also, I added 'touch -t' to set the modified times so we can fix the order in which the packs are selected. * Added a comment about the purpose of pack_perm. Thanks, -Stolee Derrick Stolee (10): repack: refactor pack deletion for future use Docs: rearrange subcommands for multi-pack-index multi-pack-index: prepare for 'expire' subcommand midx: simplify computation of pack name lengths midx: refactor permutation logic and pack sorting multi-pack-index: implement 'expire' subcommand multi-pack-index: prepare 'repack' subcommand midx: implement midx_repack() multi-pack-index: test expire while adding packs midx: add test that 'expire' respects .keep files Documentation/git-multi-pack-index.txt | 26 +- builtin/multi-pack-index.c | 14 +- builtin/repack.c | 14 +- midx.c | 399 ++++++++++++++++++------- midx.h | 2 + packfile.c | 28 ++ packfile.h | 7 + t/t5319-multi-pack-index.sh | 165 ++++++++++ 8 files changed, 536 insertions(+), 119 deletions(-) base-commit: 26aa9fc81d4c7f6c3b456a29da0b7ec72e5c6595 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-92%2Fderrickstolee%2Fmidx-expire%2Fupstream-v4 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-92/derrickstolee/midx-expire/upstream-v4 Pull-Request: https://github.com/gitgitgadget/git/pull/92 Range-diff vs v3: 1: 62b393b816 = 1: 62b393b816 repack: refactor pack deletion for future use 2: 7886785904 = 2: 7886785904 Docs: rearrange subcommands for multi-pack-index 3: f06382b4ae ! 3: 628ca46036 multi-pack-index: prepare for 'expire' subcommand @@ -16,7 +16,9 @@ Add a test that verifies the 'expire' subcommand is correctly wired, but will still be valid when the verb is implemented. Specifically, create a set of packs that should all have referenced objects and - should not be removed during an 'expire' operation. + should not be removed during an 'expire' operation. The packs are + created carefully to ensure they have a specific order when sorted + by size. This will be important in a later test. Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> @@ -95,6 +97,8 @@ + ( + cd dup && + git init && ++ test-tool genrandom "data" 4096 >large_file.txt && ++ git update-index --add large_file.txt && + for i in $(test_seq 1 20) + do + test_commit $i @@ -104,24 +108,24 @@ + git branch C HEAD~13 && + git branch D HEAD~16 && + git branch E HEAD~18 && -+ git pack-objects --revs .git/objects/pack/pack-E <<-EOF && -+ refs/heads/E ++ git pack-objects --revs .git/objects/pack/pack-A <<-EOF && ++ refs/heads/A ++ ^refs/heads/B + EOF -+ git pack-objects --revs .git/objects/pack/pack-D <<-EOF && -+ refs/heads/D -+ ^refs/heads/E ++ git pack-objects --revs .git/objects/pack/pack-B <<-EOF && ++ refs/heads/B ++ ^refs/heads/C + EOF + git pack-objects --revs .git/objects/pack/pack-C <<-EOF && + refs/heads/C + ^refs/heads/D + EOF -+ git pack-objects --revs .git/objects/pack/pack-B <<-EOF && -+ refs/heads/B -+ ^refs/heads/C ++ git pack-objects --revs .git/objects/pack/pack-D <<-EOF && ++ refs/heads/D ++ ^refs/heads/E + EOF -+ git pack-objects --revs .git/objects/pack/pack-A <<-EOF && -+ refs/heads/A -+ ^refs/heads/B ++ git pack-objects --revs .git/objects/pack/pack-E <<-EOF && ++ refs/heads/E + EOF + git multi-pack-index write + ) 4: 2a763990ae ! 4: d55c1d7ee7 midx: simplify computation of pack name lengths @@ -12,7 +12,7 @@ dir not already covered by the multi-pack-index. In anticipation of this becoming more complicated with the 'expire' - command, simplify the computation by centralizing it to a single + subcommand, simplify the computation by centralizing it to a single loop before writing the file. Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> 5: a0d4cc6cb3 ! 5: 3950743b96 midx: refactor permutation logic and pack sorting @@ -282,6 +282,12 @@ + QSORT(packs.info, packs.nr, pack_info_compare); + ++ /* ++ * pack_perm stores a permutation between pack-int-ids from the ++ * previous multi-pack-index to the new one we are writing: ++ * ++ * pack_perm[old_id] = new_id ++ */ + ALLOC_ARRAY(pack_perm, packs.nr); + for (i = 0; i < packs.nr; i++) { + pack_perm[packs.info[i].orig_pack_int_id] = i; 6: 4dbff40e7a ! 6: 6691d97902 multi-pack-index: implement 'expire' verb @@ -1,8 +1,8 @@ Author: Derrick Stolee <dstolee@xxxxxxxxxxxxx> - multi-pack-index: implement 'expire' verb + multi-pack-index: implement 'expire' subcommand - The 'git multi-pack-index expire' command looks at the existing + The 'git multi-pack-index expire' subcommand looks at the existing mult-pack-index, counts the number of objects referenced in each pack-file, deletes the pack-fils with no referenced objects, and rewrites the multi-pack-index to no longer reference those packs. @@ -18,7 +18,7 @@ Test that a new pack-file that covers the contents of two other pack-files leads to those pack-files being deleted during the - expire command. Be sure to read the multi-pack-index to ensure + expire subcommand. Be sure to read the multi-pack-index to ensure it no longer references those packs. Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> @@ -161,6 +161,11 @@ + } + } + + /* + * pack_perm stores a permutation between pack-int-ids from the + * previous multi-pack-index to the new one we are writing: +@@ + */ ALLOC_ARRAY(pack_perm, packs.nr); for (i = 0; i < packs.nr; i++) { - pack_perm[packs.info[i].orig_pack_int_id] = i; @@ -273,7 +278,9 @@ + test_cmp expect actual && + ls .git/objects/pack/ | grep idx >expect-idx && + test-tool read-midx .git/objects | grep idx >actual-midx && -+ test_cmp expect-idx actual-midx ++ test_cmp expect-idx actual-midx && ++ git multi-pack-index verify && ++ git fsck + ) +' + 7: b39f90ad09 ! 7: f5a8ff21dd multi-pack-index: prepare 'repack' subcommand @@ -11,7 +11,7 @@ operation does not interrupt concurrent git commands. Introduce a 'repack' subcommand to 'git multi-pack-index' that - takes a '--batch-size' option. The verb will inspect the + takes a '--batch-size' option. The subcommand will inspect the multi-pack-index for referenced pack-files whose size is smaller than the batch size, until collecting a list of pack-files whose sizes sum to larger than the batch size. Then, a new pack-file @@ -26,6 +26,11 @@ we specify a small batch size, we will guarantee that future implementations do not change the list of pack-files. + In addition, we hard-code the modified times of the packs in + the pack directory to ensure the list of packs sorted by modified + time matches the order if sorted by size (ascending). This will + be important in a future test. + Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt @@ -36,15 +41,15 @@ afterward to remove all references to these pack-files. +repack:: -+ Collect a batch of pack-files whose size are all at most the -+ size given by --batch-size, but whose sizes sum to larger -+ than --batch-size. The batch is selected by greedily adding -+ small pack-files starting with the oldest pack-files that fit -+ the size. Create a new pack-file containing the objects the -+ multi-pack-index indexes into those pack-files, and rewrite -+ the multi-pack-index to contain that pack-file. A later run -+ of 'git multi-pack-index expire' will delete the pack-files -+ that were part of this batch. ++ Create a new pack-file containing objects in small pack-files ++ referenced by the multi-pack-index. Select the pack-files by ++ examining packs from oldest-to-newest, adding a pack if its ++ size is below the batch size. Stop adding packs when the sum ++ of sizes of the added packs is above the batch size. If the ++ total size does not reach the batch size, then do nothing. ++ Rewrite the multi-pack-index to reference the new pack-file. ++ A later run of 'git multi-pack-index expire' will delete the ++ pack-files that were part of this batch. + EXAMPLES @@ -84,11 +89,18 @@ + if (!strcmp(argv[0], "repack")) + return midx_repack(opts.object_dir, (size_t)opts.batch_size); + if (opts.batch_size) -+ die(_("--batch-size option is only for 'repack' verb")); ++ die(_("--batch-size option is only for 'repack' subcommand")); + if (!strcmp(argv[0], "write")) return write_midx_file(opts.object_dir); if (!strcmp(argv[0], "verify")) +@@ + if (!strcmp(argv[0], "expire")) + return expire_midx_packs(opts.object_dir); + +- die(_("unrecognized verb: %s"), argv[0]); ++ die(_("unrecognized subcommand: %s"), argv[0]); + } diff --git a/midx.c b/midx.c --- a/midx.c @@ -125,6 +137,12 @@ +test_expect_success 'repack with minimum size does not alter existing packs' ' + ( + cd dup && ++ rm -rf .git/objects/pack && ++ mv .git/objects/pack-backup .git/objects/pack && ++ touch -m -t 201901010000 .git/objects/pack/pack-D* && ++ touch -m -t 201901010001 .git/objects/pack/pack-C* && ++ touch -m -t 201901010002 .git/objects/pack/pack-B* && ++ touch -m -t 201901010003 .git/objects/pack/pack-A* && + ls .git/objects/pack >expect && + MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) && + git multi-pack-index repack --batch-size=$MINSIZE && 8: a4c2d5a8e1 ! 8: ba1a1c7bbb midx: implement midx_repack() @@ -149,6 +149,16 @@ diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh +@@ + git pack-objects --revs .git/objects/pack/pack-E <<-EOF && + refs/heads/E + EOF +- git multi-pack-index write ++ git multi-pack-index write && ++ cp -r .git/objects/pack .git/objects/pack-backup + ) + ' + @@ ) ' @@ -156,25 +166,28 @@ +test_expect_success 'repack creates a new pack' ' + ( + cd dup && -+ SECOND_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 2 | tail -n 1) && -+ BATCH_SIZE=$(($SECOND_SMALLEST_SIZE + 1)) && -+ git multi-pack-index repack --batch-size=$BATCH_SIZE && + ls .git/objects/pack/*idx >idx-list && + test_line_count = 5 idx-list && ++ THIRD_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 3 | tail -n 1) && ++ BATCH_SIZE=$(($THIRD_SMALLEST_SIZE + 1)) && ++ git multi-pack-index repack --batch-size=$BATCH_SIZE && ++ ls .git/objects/pack/*idx >idx-list && ++ test_line_count = 6 idx-list && + test-tool read-midx .git/objects | grep idx >midx-list && -+ test_line_count = 5 midx-list ++ test_line_count = 6 midx-list + ) +' + +test_expect_success 'expire removes repacked packs' ' + ( + cd dup && -+ ls -S .git/objects/pack/*pack | head -n 3 >expect && ++ ls -al .git/objects/pack/*pack && ++ ls -S .git/objects/pack/*pack | head -n 4 >expect && + git multi-pack-index expire && + ls -S .git/objects/pack/*pack >actual && + test_cmp expect actual && + test-tool read-midx .git/objects | grep idx >midx-list && -+ test_line_count = 3 midx-list ++ test_line_count = 4 midx-list + ) +' + 9: b97fb35ba9 = 9: b1c6892417 multi-pack-index: test expire while adding packs -: ---------- > 10: 481b08890f midx: add test that 'expire' respects .keep files -- gitgitgadget