On 2019.01.24 13:51, Derrick Stolee via GitGitGadget wrote: > The multi-pack-index provides a fast way to find an object among a large > list of pack-files. It stores a single pack-reference for each object id, so > duplicate objects are ignored. Among a list of pack-files storing the same > object, the most-recently modified one is used. > > Create new subcommands for the multi-pack-index builtin. > > * 'git multi-pack-index expire': If we have a pack-file indexed by the > multi-pack-index, but all objects in that pack are duplicated in > more-recently modified packs, then delete that pack (and any others like > it). Delete the reference to that pack in the multi-pack-index. > > > * 'git multi-pack-index repack --batch-size=': Starting from the oldest > pack-files covered by the multi-pack-index, find those whose on-disk size > is below the batch size until we have a collection of packs whose sizes > add up to the batch size. Create a new pack containing all objects that > the multi-pack-index references to those packs. > > > > This allows us to create a new pattern for repacking objects: run 'repack'. > After enough time has passed that all Git commands that started before the > last 'repack' are finished, run 'expire' again. This approach has some > advantages over the existing "repack everything" model: > > 1. Incremental. We can repack a small batch of objects at a time, instead > of repacking all reachable objects. We can also limit ourselves to the > objects that do not appear in newer pack-files. > > > 2. Highly Available. By adding a new pack-file (and not deleting the old > pack-files) we do not interrupt concurrent Git commands, and do not > suffer performance degradation. By expiring only pack-files that have no > referenced objects, we know that Git commands that are doing normal > object lookups* will not be interrupted. > > > 3. Note: if someone concurrently runs a Git command that uses > get_all_packs(), then that command could try to read the pack-files and > pack-indexes that we are deleting during an expire command. Such > commands are usually related to object maintenance (i.e. fsck, gc, > pack-objects) or are related to less-often-used features (i.e. > fast-import, http-backend, server-info). > > > > We plan to use this approach in VFS for Git to do background maintenance of > the "shared object cache" which is a Git alternate directory filled with > packfiles containing commits and trees. We currently download pack-files on > an hourly basis to keep up-to-date with the central server. The cache > servers supply packs on an hourly and daily basis, so most of the hourly > packs become useless after a new daily pack is downloaded. The 'expire' > command would clear out most of those packs, but many will still remain with > fewer than 100 objects remaining. The 'repack' command (with a batch size of > 1-3gb, probably) can condense the remaining packs in commands that run for > 1-3 min at a time. Since the daily packs range from 100-250mb, we will also > combine and condense those packs. > > Updates in V2: > > * Added a method, unlink_pack_path() to remove packfiles, but with the > additional check for a .keep file. This borrows logic from > builtin/repack.c. > > > * Modified documentation and commit messages to replace 'verb' with > 'subcommand'. Simplified the documentation. (I left 'verbs' in the title > of the cover letter for consistency.) > > > > Updates in V3: > > * There was a bug in the expire logic when simultaneously removing packs > and adding uncovered packs, specifically around the pack permutation. > This was hard to see during review because I was using the 'pack_perm' > array for multiple purposes. First, I was reducing its length, and then I > was adding to it and resorting. In V3, I significantly overhauled the > logic here, which required some extra commits before implementing > 'expire'. The final commit includes a test that would cover this case. > > Updates in V4: > > * More 'verb' and 'command' instances replaced with 'subcommand'. I grepped > the patch to check these should be fixed everywhere. > > > * Update the tests to check .keep files (in last patch). > > > * Modify the tests to show the terminating condition of --batch-size when > there are three packs that fit under the size, but the first two are > large enough to stop adding packs. This required rearranging the packs > slightly to get different sizes than we had before. Also, I added 'touch > -t' to set the modified times so we can fix the order in which the packs > are selected. > > > * Added a comment about the purpose of pack_perm. > > > > Thanks, -Stolee > > Derrick Stolee (10): > repack: refactor pack deletion for future use > Docs: rearrange subcommands for multi-pack-index > multi-pack-index: prepare for 'expire' subcommand > midx: simplify computation of pack name lengths > midx: refactor permutation logic and pack sorting > multi-pack-index: implement 'expire' subcommand > multi-pack-index: prepare 'repack' subcommand > midx: implement midx_repack() > multi-pack-index: test expire while adding packs > midx: add test that 'expire' respects .keep files > > Documentation/git-multi-pack-index.txt | 26 +- > builtin/multi-pack-index.c | 14 +- > builtin/repack.c | 14 +- > midx.c | 399 ++++++++++++++++++------- > midx.h | 2 + > packfile.c | 28 ++ > packfile.h | 7 + > t/t5319-multi-pack-index.sh | 165 ++++++++++ > 8 files changed, 536 insertions(+), 119 deletions(-) > > > base-commit: 26aa9fc81d4c7f6c3b456a29da0b7ec72e5c6595 > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-92%2Fderrickstolee%2Fmidx-expire%2Fupstream-v4 > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-92/derrickstolee/midx-expire/upstream-v4 > Pull-Request: https://github.com/gitgitgadget/git/pull/92 > > Range-diff vs v3: > > 1: 62b393b816 = 1: 62b393b816 repack: refactor pack deletion for future use > 2: 7886785904 = 2: 7886785904 Docs: rearrange subcommands for multi-pack-index > 3: f06382b4ae ! 3: 628ca46036 multi-pack-index: prepare for 'expire' subcommand > @@ -16,7 +16,9 @@ > Add a test that verifies the 'expire' subcommand is correctly wired, > but will still be valid when the verb is implemented. Specifically, > create a set of packs that should all have referenced objects and > - should not be removed during an 'expire' operation. > + should not be removed during an 'expire' operation. The packs are > + created carefully to ensure they have a specific order when sorted > + by size. This will be important in a later test. > > Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> > > @@ -95,6 +97,8 @@ > + ( > + cd dup && > + git init && > ++ test-tool genrandom "data" 4096 >large_file.txt && > ++ git update-index --add large_file.txt && > + for i in $(test_seq 1 20) > + do > + test_commit $i > @@ -104,24 +108,24 @@ > + git branch C HEAD~13 && > + git branch D HEAD~16 && > + git branch E HEAD~18 && > -+ git pack-objects --revs .git/objects/pack/pack-E <<-EOF && > -+ refs/heads/E > ++ git pack-objects --revs .git/objects/pack/pack-A <<-EOF && > ++ refs/heads/A > ++ ^refs/heads/B > + EOF > -+ git pack-objects --revs .git/objects/pack/pack-D <<-EOF && > -+ refs/heads/D > -+ ^refs/heads/E > ++ git pack-objects --revs .git/objects/pack/pack-B <<-EOF && > ++ refs/heads/B > ++ ^refs/heads/C > + EOF > + git pack-objects --revs .git/objects/pack/pack-C <<-EOF && > + refs/heads/C > + ^refs/heads/D > + EOF > -+ git pack-objects --revs .git/objects/pack/pack-B <<-EOF && > -+ refs/heads/B > -+ ^refs/heads/C > ++ git pack-objects --revs .git/objects/pack/pack-D <<-EOF && > ++ refs/heads/D > ++ ^refs/heads/E > + EOF > -+ git pack-objects --revs .git/objects/pack/pack-A <<-EOF && > -+ refs/heads/A > -+ ^refs/heads/B > ++ git pack-objects --revs .git/objects/pack/pack-E <<-EOF && > ++ refs/heads/E > + EOF > + git multi-pack-index write > + ) > 4: 2a763990ae ! 4: d55c1d7ee7 midx: simplify computation of pack name lengths > @@ -12,7 +12,7 @@ > dir not already covered by the multi-pack-index. > > In anticipation of this becoming more complicated with the 'expire' > - command, simplify the computation by centralizing it to a single > + subcommand, simplify the computation by centralizing it to a single > loop before writing the file. > > Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> > 5: a0d4cc6cb3 ! 5: 3950743b96 midx: refactor permutation logic and pack sorting > @@ -282,6 +282,12 @@ > > + QSORT(packs.info, packs.nr, pack_info_compare); > + > ++ /* > ++ * pack_perm stores a permutation between pack-int-ids from the > ++ * previous multi-pack-index to the new one we are writing: > ++ * > ++ * pack_perm[old_id] = new_id > ++ */ > + ALLOC_ARRAY(pack_perm, packs.nr); > + for (i = 0; i < packs.nr; i++) { > + pack_perm[packs.info[i].orig_pack_int_id] = i; > 6: 4dbff40e7a ! 6: 6691d97902 multi-pack-index: implement 'expire' verb > @@ -1,8 +1,8 @@ > Author: Derrick Stolee <dstolee@xxxxxxxxxxxxx> > > - multi-pack-index: implement 'expire' verb > + multi-pack-index: implement 'expire' subcommand > > - The 'git multi-pack-index expire' command looks at the existing > + The 'git multi-pack-index expire' subcommand looks at the existing > mult-pack-index, counts the number of objects referenced in each > pack-file, deletes the pack-fils with no referenced objects, and > rewrites the multi-pack-index to no longer reference those packs. > @@ -18,7 +18,7 @@ > > Test that a new pack-file that covers the contents of two other > pack-files leads to those pack-files being deleted during the > - expire command. Be sure to read the multi-pack-index to ensure > + expire subcommand. Be sure to read the multi-pack-index to ensure > it no longer references those packs. > > Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> > @@ -161,6 +161,11 @@ > + } > + } > + > + /* > + * pack_perm stores a permutation between pack-int-ids from the > + * previous multi-pack-index to the new one we are writing: > +@@ > + */ > ALLOC_ARRAY(pack_perm, packs.nr); > for (i = 0; i < packs.nr; i++) { > - pack_perm[packs.info[i].orig_pack_int_id] = i; > @@ -273,7 +278,9 @@ > + test_cmp expect actual && > + ls .git/objects/pack/ | grep idx >expect-idx && > + test-tool read-midx .git/objects | grep idx >actual-midx && > -+ test_cmp expect-idx actual-midx > ++ test_cmp expect-idx actual-midx && > ++ git multi-pack-index verify && > ++ git fsck > + ) > +' > + > 7: b39f90ad09 ! 7: f5a8ff21dd multi-pack-index: prepare 'repack' subcommand > @@ -11,7 +11,7 @@ > operation does not interrupt concurrent git commands. > > Introduce a 'repack' subcommand to 'git multi-pack-index' that > - takes a '--batch-size' option. The verb will inspect the > + takes a '--batch-size' option. The subcommand will inspect the > multi-pack-index for referenced pack-files whose size is smaller > than the batch size, until collecting a list of pack-files whose > sizes sum to larger than the batch size. Then, a new pack-file > @@ -26,6 +26,11 @@ > we specify a small batch size, we will guarantee that future > implementations do not change the list of pack-files. > > + In addition, we hard-code the modified times of the packs in > + the pack directory to ensure the list of packs sorted by modified > + time matches the order if sorted by size (ascending). This will > + be important in a future test. > + > Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> > > diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt > @@ -36,15 +41,15 @@ > afterward to remove all references to these pack-files. > > +repack:: > -+ Collect a batch of pack-files whose size are all at most the > -+ size given by --batch-size, but whose sizes sum to larger > -+ than --batch-size. The batch is selected by greedily adding > -+ small pack-files starting with the oldest pack-files that fit > -+ the size. Create a new pack-file containing the objects the > -+ multi-pack-index indexes into those pack-files, and rewrite > -+ the multi-pack-index to contain that pack-file. A later run > -+ of 'git multi-pack-index expire' will delete the pack-files > -+ that were part of this batch. > ++ Create a new pack-file containing objects in small pack-files > ++ referenced by the multi-pack-index. Select the pack-files by > ++ examining packs from oldest-to-newest, adding a pack if its > ++ size is below the batch size. Stop adding packs when the sum > ++ of sizes of the added packs is above the batch size. If the > ++ total size does not reach the batch size, then do nothing. > ++ Rewrite the multi-pack-index to reference the new pack-file. > ++ A later run of 'git multi-pack-index expire' will delete the > ++ pack-files that were part of this batch. > + > > EXAMPLES > @@ -84,11 +89,18 @@ > + if (!strcmp(argv[0], "repack")) > + return midx_repack(opts.object_dir, (size_t)opts.batch_size); > + if (opts.batch_size) > -+ die(_("--batch-size option is only for 'repack' verb")); > ++ die(_("--batch-size option is only for 'repack' subcommand")); > + > if (!strcmp(argv[0], "write")) > return write_midx_file(opts.object_dir); > if (!strcmp(argv[0], "verify")) > +@@ > + if (!strcmp(argv[0], "expire")) > + return expire_midx_packs(opts.object_dir); > + > +- die(_("unrecognized verb: %s"), argv[0]); > ++ die(_("unrecognized subcommand: %s"), argv[0]); > + } > > diff --git a/midx.c b/midx.c > --- a/midx.c > @@ -125,6 +137,12 @@ > +test_expect_success 'repack with minimum size does not alter existing packs' ' > + ( > + cd dup && > ++ rm -rf .git/objects/pack && > ++ mv .git/objects/pack-backup .git/objects/pack && > ++ touch -m -t 201901010000 .git/objects/pack/pack-D* && > ++ touch -m -t 201901010001 .git/objects/pack/pack-C* && > ++ touch -m -t 201901010002 .git/objects/pack/pack-B* && > ++ touch -m -t 201901010003 .git/objects/pack/pack-A* && > + ls .git/objects/pack >expect && > + MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) && > + git multi-pack-index repack --batch-size=$MINSIZE && > 8: a4c2d5a8e1 ! 8: ba1a1c7bbb midx: implement midx_repack() > @@ -149,6 +149,16 @@ > diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh > --- a/t/t5319-multi-pack-index.sh > +++ b/t/t5319-multi-pack-index.sh > +@@ > + git pack-objects --revs .git/objects/pack/pack-E <<-EOF && > + refs/heads/E > + EOF > +- git multi-pack-index write > ++ git multi-pack-index write && > ++ cp -r .git/objects/pack .git/objects/pack-backup > + ) > + ' > + > @@ > ) > ' > @@ -156,25 +166,28 @@ > +test_expect_success 'repack creates a new pack' ' > + ( > + cd dup && > -+ SECOND_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 2 | tail -n 1) && > -+ BATCH_SIZE=$(($SECOND_SMALLEST_SIZE + 1)) && > -+ git multi-pack-index repack --batch-size=$BATCH_SIZE && > + ls .git/objects/pack/*idx >idx-list && > + test_line_count = 5 idx-list && > ++ THIRD_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 3 | tail -n 1) && > ++ BATCH_SIZE=$(($THIRD_SMALLEST_SIZE + 1)) && > ++ git multi-pack-index repack --batch-size=$BATCH_SIZE && > ++ ls .git/objects/pack/*idx >idx-list && > ++ test_line_count = 6 idx-list && > + test-tool read-midx .git/objects | grep idx >midx-list && > -+ test_line_count = 5 midx-list > ++ test_line_count = 6 midx-list > + ) > +' > + > +test_expect_success 'expire removes repacked packs' ' > + ( > + cd dup && > -+ ls -S .git/objects/pack/*pack | head -n 3 >expect && > ++ ls -al .git/objects/pack/*pack && > ++ ls -S .git/objects/pack/*pack | head -n 4 >expect && > + git multi-pack-index expire && > + ls -S .git/objects/pack/*pack >actual && > + test_cmp expect actual && > + test-tool read-midx .git/objects | grep idx >midx-list && > -+ test_line_count = 3 midx-list > ++ test_line_count = 4 midx-list > + ) > +' > + > 9: b97fb35ba9 = 9: b1c6892417 multi-pack-index: test expire while adding packs > -: ---------- > 10: 481b08890f midx: add test that 'expire' respects .keep files > > -- > gitgitgadget With the exception of the broken test in patch 7, and some minor style diffs, this all looks good to me.