Re: [PATCH v2 0/4] [RFC] repack: add --filter=

John Cai <johncai86@xxxxxxxxx> · Sat, 26 Feb 2022 15:19:11 -0500

Hi Taylor,

(resending this because my email client misbehaved and set the mime type to html)

On 26 Feb 2022, at 12:29, Taylor Blau wrote:

> On Sat, Feb 26, 2022 at 11:01:46AM -0500, John Cai wrote:
>> let me try to summarize (perhaps over simplify) the main concern folks
>> have with this feature, so please correct me if I'm wrong!
>>
>> As a user, if I apply a filter that ends up deleting objects that it
>> turns out do not exist anywhere else, then I have irrecoverably
>> corrupted my repository.
>>
>> Before git allows me to delete objects from my repository, it should
>> be pretty certain that I have path to recover those objects if I need
>> to.
>>
>> Is that correct? It seems to me that, put another way, we don't want
>> to give users too much rope to hang themselves.
>
> I wrote about my concerns in some more detail in [1], but the thing I
> was most unclear on was how your demo script[2] was supposed to work.
>
> Namely, I wasn't sure if you had intended to use two separate filters to
> "re-filter" a repository, one to filter objects to be uploaded to a
> content store, and another to filter objects to be expunged from the
> repository. I have major concerns with that approach, namely that if
> each of the filters is not exactly the inverse of the other, then we
> will either upload too few objects, or delete too many.

Thanks for bringing this up again. I meant to write back regarding what you raised
in the other part of this thread. I think this is a valid concern. To attain the
goal of offloading certain blobs onto another server(B) and saving space on a git
server(A), then there will essentially be two steps. One to upload objects to (B),
and one to remove objects from (A). As you said, these two need to be the inverse of each
other or else you might end up with missing objects.

Thinking about it more, there is also an issue of timing. Even if the filters
are exact inverses of each other, let's say we have the following order of
events:

- (A)'s large blobs get upload to (B)
- large blob (C) get added to (A)
- (A) gets repacked with a filter

In this case we could lose (C) forever. So it does seem like we need some built in guarantee
that we only shed objects from the repo if we know we can retrieve them later on.

>
> My other concern was around what guarantees we currently provide for a
> promisor remote. My understanding is that we expect an object which was
> received from the promisor remote to always be fetch-able later on. If
> that's the case, then I don't mind the idea of refiltering a repository,
> provided that you only need to specify a filter once.

Could you clarify what you mean by re-filtering a repository? By that I assumed
it meant specifying a filter eg: 100mb, and then narrowing it by specifying a
50mb filter.
>
> So the suggestion about splitting a repository into two packs was a
> potential way to mediate the "two filter" problem, since the two packs
> you get exactly correspond to the set of objects that match the filter,
> and the set of objects that _don't_ match the filter.
>
> In either case, I tried to use the patches in [1] and was able to
> corrupt my local repository (even when fetching from a remote that held
> onto the objects I had pruned locally).
>
> Thanks,
> Taylor

Thanks!
John

>
> [1]: https://lore.kernel.org/git/YhUeUCIetu%2FaOu6k@nand.local/
> [2]: https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt#L47-52