Re: [PATCH 2/2] repack: add --filter=<filter-spec> option

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry forgot to include the link to Christian's demo. included below

On 29 Jan 2022, at 14:14, John Cai wrote:

> Hi Stolee,
>
> Thanks for taking the time to review this patch! I added some points of clarification
> down below.
>
> On 27 Jan 2022, at 10:03, Derrick Stolee wrote:
>
>> On 1/26/2022 8:49 PM, John Cai via GitGitGadget wrote:
>>> From: John Cai <johncai86@xxxxxxxxx>
>>>
>>> Currently, repack does not work with partial clones. When repack is run
>>> on a partially cloned repository, it grabs all missing objects from
>>> promisor remotes. This also means that when gc is run for repository
>>> maintenance on a partially cloned repository, it will end up getting
>>> missing objects, which is not what we want.
>>
>> This shouldn't be what is happening. Do you have a demonstration of
>> this happening? repack_promisor_objects() should be avoiding following
>> links outside of promisor packs so we can safely 'git gc' in a partial
>> clone without downloading all reachable blobs.
>
> You're right, sorry I was mistaken about this detail of how partial clones work.
>>
>>> In order to make repack work with partial clone, teach repack a new
>>> option --filter, which takes a <filter-spec> argument. repack will skip
>>> any objects that are matched by <filter-spec> similar to how the clone
>>> command will skip fetching certain objects.
>>
>> This is a bit misleading, since 'git clone' doesn't "skip fetching",
>> but instead requests a filter and the server can choose to write a
>> pack-file using that filter. I'm not sure if it's worth how pedantic
>> I'm being here.
>
> Thanks for the more precise description of the mechanics of partial clone.
> I'll improve the wording in the next version of this patch series.
>
>>
>> The thing that I find confusing here is that you are adding an option
>> that could be run on a _full_ repository. If I have a set of packs
>> and none of them are promisor (I have every reachable object), then
>> what is the end result after 'git repack -adf --filter=blob:none'?
>> Those existing pack-files shouldn't be deleted because they have
>> objects that are not in the newly-created pack-file.
>>
>> I'd like to see some additional clarity on this before continuing
>> to review this series.
>
> Apologies for the lack of clarity. Indeed, I can see why this is the most important
> detail of this patch to provide enough context on, as it involves deleting
> objects from a full repository as you said.
>
> To back up a little, the goal is to be able to offload large
> blobs to a separate http server. Christian Couder has a demo [1] that shows
> this in detail.
>
> If we had the following:
> A. an http server to use as a generalized object store
> B. a server update hook that uploads large blobs to 1.
> C. a git server
> D. a regular job that runs `git repack --filter` to remove large
> blobs from C.
>
> Clients would need to configure both C) and A) as promisor remotes to
> be able to get everything. When they push new large blobs, they can
> still push them to C), as B) will upload them to A), and D) will
> regularly remove those large blobs from C).
>
> This way with a little bit of client and server configuration, we can have
> a native way to support offloading large files without git LFS.
> It would be more flexible as you can easily tweak which blobs are considered large
> files by tweaking B) and D).
>

[1] https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt

>>
>>> The final goal of this feature, is to be able to store objects on a
>>> server other than the regular git server itself.
>>>
>>> There are several scripts added so we can test the process of using a
>>> remote helper to upload blobs to an http server:
>>>
>>> - t/lib-httpd/list.sh lists blobs uploaded to the http server.
>>> - t/lib-httpd/upload.sh uploads blobs to the http server.
>>> - t/t0410/git-remote-testhttpgit a remote helper that can access blobs
>>>   onto from an http server. Copied over from t/t5801/git-remote-testhttpgit
>>>   and modified to upload blobs to an http server.
>>> - t/t0410/lib-http-promisor.sh convenience functions for uploading
>>>   blobs
>>
>> I think these changes to the tests should be extracted to a new
>> patch where this can be discussed in more detail. I didn't look
>> too closely at them because I want to focus on whether this
>> --filter option is a good direction for 'git repack'.
>>
>>>  		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>>> @@ -819,6 +824,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>>>  		if (line.len != the_hash_algo->hexsz)
>>>  			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
>>>  		string_list_append(&names, line.buf);
>>> +		if (po_args.filter) {
>>> +			char *promisor_name = mkpathdup("%s-%s.promisor", packtmp,
>>> +							line.buf);
>>> +			write_promisor_file(promisor_name, NULL, 0);
>>
>> This code is duplicated in repack_promisor_objects(), so it would be
>> good to extract that logic into a helper method called by both places.
>
> Thanks for pointing this out. I'll incorporate this into the next version.
>>
>>> +		}
>>>  	}
>>>  	fclose(out);
>>>  	ret = finish_command(&cmd);
>>
>>> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
>>> index e489869dd94..78cc1858cb6 100755
>>> --- a/t/t7700-repack.sh
>>> +++ b/t/t7700-repack.sh
>>> @@ -237,6 +237,26 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
>>>  	test_must_be_empty actual
>>>  '
>>>
>>> +test_expect_success 'repack with filter does not fetch from remote' '
>>> +	rm -rf server client &&
>>> +	test_create_repo server &&
>>> +	git -C server config uploadpack.allowFilter true &&
>>> +	git -C server config uploadpack.allowAnySHA1InWant true &&
>>> +	echo content1 >server/file1 &&
>>> +	git -C server add file1 &&
>>> +	git -C server commit -m initial_commit &&
>>> +	expected="?$(git -C server rev-parse :file1)" &&
>>> +	git clone --bare --no-local server client &&
>>
>> You could use "file:://$(pwd)/server" here instead of "server".
>
> good point, thanks
>
>>
>>> +	git -C client config remote.origin.promisor true &&
>>> +	git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
>> This isn't testing what you want it to test, because your initial
>> clone doesn't use --filter=blob:none, so you already have all of
>> the objects in the client. You would never trigger a need for a
>> fetch from the remote.
>
> right, so this test is actually testing that repack --filter would shed objects to show
> that it can be used as D) as a regular cleanup job for git servers that utilize another
> http server to host large blobs.
>
>>
>>> +	git -C client rev-list --objects --all --missing=print >objects &&
>>> +	grep "$expected" objects &&
>>> +	git -C client repack -a -d &&
>>> +	expected="$(git -C server rev-parse :file1)" &&
>>
>> This is signalling to me that you are looking for a remote fetch
>> now that you are repacking everything, and that can only happen
>> if you deleted objects from the client during your first repack.
>> That seems incorrect.
>>
>>> +	git -C client rev-list --objects --all --missing=print >objects &&
>>> +	grep "$expected" objects
>>> +'
>>
>> Based on my current understanding, this patch seems unnecessary (repacks
>> should already be doing the right thing when in the presence of a partial
>> clone) and incorrect (we should not delete existing reachable objects
>> when repacking with a filter).
>>
>> I look forward to hearing more about your intended use of this feature so
>> we can land on a better way to solve the problems you are having.
>
> Thanks for the callouts on the big picture of this proposed change. Looking
> forward to getting your thoughts on this!
>>
>> Thanks,
>> -Stolee




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux