Sorry forgot to include the link to Christian's demo. included below On 29 Jan 2022, at 14:14, John Cai wrote: > Hi Stolee, > > Thanks for taking the time to review this patch! I added some points of clarification > down below. > > On 27 Jan 2022, at 10:03, Derrick Stolee wrote: > >> On 1/26/2022 8:49 PM, John Cai via GitGitGadget wrote: >>> From: John Cai <johncai86@xxxxxxxxx> >>> >>> Currently, repack does not work with partial clones. When repack is run >>> on a partially cloned repository, it grabs all missing objects from >>> promisor remotes. This also means that when gc is run for repository >>> maintenance on a partially cloned repository, it will end up getting >>> missing objects, which is not what we want. >> >> This shouldn't be what is happening. Do you have a demonstration of >> this happening? repack_promisor_objects() should be avoiding following >> links outside of promisor packs so we can safely 'git gc' in a partial >> clone without downloading all reachable blobs. > > You're right, sorry I was mistaken about this detail of how partial clones work. >> >>> In order to make repack work with partial clone, teach repack a new >>> option --filter, which takes a <filter-spec> argument. repack will skip >>> any objects that are matched by <filter-spec> similar to how the clone >>> command will skip fetching certain objects. >> >> This is a bit misleading, since 'git clone' doesn't "skip fetching", >> but instead requests a filter and the server can choose to write a >> pack-file using that filter. I'm not sure if it's worth how pedantic >> I'm being here. > > Thanks for the more precise description of the mechanics of partial clone. > I'll improve the wording in the next version of this patch series. > >> >> The thing that I find confusing here is that you are adding an option >> that could be run on a _full_ repository. If I have a set of packs >> and none of them are promisor (I have every reachable object), then >> what is the end result after 'git repack -adf --filter=blob:none'? >> Those existing pack-files shouldn't be deleted because they have >> objects that are not in the newly-created pack-file. >> >> I'd like to see some additional clarity on this before continuing >> to review this series. > > Apologies for the lack of clarity. Indeed, I can see why this is the most important > detail of this patch to provide enough context on, as it involves deleting > objects from a full repository as you said. > > To back up a little, the goal is to be able to offload large > blobs to a separate http server. Christian Couder has a demo [1] that shows > this in detail. > > If we had the following: > A. an http server to use as a generalized object store > B. a server update hook that uploads large blobs to 1. > C. a git server > D. a regular job that runs `git repack --filter` to remove large > blobs from C. > > Clients would need to configure both C) and A) as promisor remotes to > be able to get everything. When they push new large blobs, they can > still push them to C), as B) will upload them to A), and D) will > regularly remove those large blobs from C). > > This way with a little bit of client and server configuration, we can have > a native way to support offloading large files without git LFS. > It would be more flexible as you can easily tweak which blobs are considered large > files by tweaking B) and D). > [1] https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt >> >>> The final goal of this feature, is to be able to store objects on a >>> server other than the regular git server itself. >>> >>> There are several scripts added so we can test the process of using a >>> remote helper to upload blobs to an http server: >>> >>> - t/lib-httpd/list.sh lists blobs uploaded to the http server. >>> - t/lib-httpd/upload.sh uploads blobs to the http server. >>> - t/t0410/git-remote-testhttpgit a remote helper that can access blobs >>> onto from an http server. Copied over from t/t5801/git-remote-testhttpgit >>> and modified to upload blobs to an http server. >>> - t/t0410/lib-http-promisor.sh convenience functions for uploading >>> blobs >> >> I think these changes to the tests should be extracted to a new >> patch where this can be discussed in more detail. I didn't look >> too closely at them because I want to focus on whether this >> --filter option is a good direction for 'git repack'. >> >>> OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"), >>> @@ -819,6 +824,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix) >>> if (line.len != the_hash_algo->hexsz) >>> die(_("repack: Expecting full hex object ID lines only from pack-objects.")); >>> string_list_append(&names, line.buf); >>> + if (po_args.filter) { >>> + char *promisor_name = mkpathdup("%s-%s.promisor", packtmp, >>> + line.buf); >>> + write_promisor_file(promisor_name, NULL, 0); >> >> This code is duplicated in repack_promisor_objects(), so it would be >> good to extract that logic into a helper method called by both places. > > Thanks for pointing this out. I'll incorporate this into the next version. >> >>> + } >>> } >>> fclose(out); >>> ret = finish_command(&cmd); >> >>> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh >>> index e489869dd94..78cc1858cb6 100755 >>> --- a/t/t7700-repack.sh >>> +++ b/t/t7700-repack.sh >>> @@ -237,6 +237,26 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' ' >>> test_must_be_empty actual >>> ' >>> >>> +test_expect_success 'repack with filter does not fetch from remote' ' >>> + rm -rf server client && >>> + test_create_repo server && >>> + git -C server config uploadpack.allowFilter true && >>> + git -C server config uploadpack.allowAnySHA1InWant true && >>> + echo content1 >server/file1 && >>> + git -C server add file1 && >>> + git -C server commit -m initial_commit && >>> + expected="?$(git -C server rev-parse :file1)" && >>> + git clone --bare --no-local server client && >> >> You could use "file:://$(pwd)/server" here instead of "server". > > good point, thanks > >> >>> + git -C client config remote.origin.promisor true && >>> + git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none && >> This isn't testing what you want it to test, because your initial >> clone doesn't use --filter=blob:none, so you already have all of >> the objects in the client. You would never trigger a need for a >> fetch from the remote. > > right, so this test is actually testing that repack --filter would shed objects to show > that it can be used as D) as a regular cleanup job for git servers that utilize another > http server to host large blobs. > >> >>> + git -C client rev-list --objects --all --missing=print >objects && >>> + grep "$expected" objects && >>> + git -C client repack -a -d && >>> + expected="$(git -C server rev-parse :file1)" && >> >> This is signalling to me that you are looking for a remote fetch >> now that you are repacking everything, and that can only happen >> if you deleted objects from the client during your first repack. >> That seems incorrect. >> >>> + git -C client rev-list --objects --all --missing=print >objects && >>> + grep "$expected" objects >>> +' >> >> Based on my current understanding, this patch seems unnecessary (repacks >> should already be doing the right thing when in the presence of a partial >> clone) and incorrect (we should not delete existing reachable objects >> when repacking with a filter). >> >> I look forward to hearing more about your intended use of this feature so >> we can land on a better way to solve the problems you are having. > > Thanks for the callouts on the big picture of this proposed change. Looking > forward to getting your thoughts on this! >> >> Thanks, >> -Stolee