Re: [PATCH 09/24] repack: implement `--extend-disjoint` mode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Dec 08, 2023 at 05:48:28PM -0500, Taylor Blau wrote:
> On Fri, Dec 08, 2023 at 09:19:25AM +0100, Patrick Steinhardt wrote:
> > > > One thing I wondered: do we need to consider the `-l` flag? When using
> > > > an alternate object directory it is totally feasible that the alternate
> > > > may be creating new disjoint packages without us knowing, and thus we
> > > > may not be able to guarantee the disjoint property anymore.
> > >
> > > I don't think so. We'd only care about one direction of this (that
> > > alternates do not create disjoint packs which overlap with ours, instead
> > > of the other way around), but since we don't put non-local packs in the
> > > MIDX, I think we're OK.
> > >
> > > I suppose that you might run into trouble if you use the chained MIDX
> > > thing (via its `->next` pointer). I haven't used that feature myself, so
> > > I'd have to play around with it.
> >
> > We do use this feature at GitLab for forks, where forks connect to a
> > common alternate object directory to deduplicate objects. As both the
> > fork repository and the alternate object directory use an MIDX I think
> > they would be set up exactly like that.
> 
> Yep, that's right. I wasn't sure whether or not this feature had been
> used extensively in production or not (we don't use it at GitHub, since
> objects only live in their fork repositories for a short while before
> moving to the fork network repository).
> 
> > I guess the only really viable solution here is to ignore disjoint packs
> > in the main repo that connects to the alternate in the case where the
> > alternate has any disjoint packs itself.
> 
> I think the behavior you'd get here is that we'd only look for disjoint
> packs in the first MIDX in the chain (which is always the local one),
> and we'd only recognizes packs from that MIDX as being potentially
> disjoint.
> 
> If you have the bulk of your repositories in the alternate, then I think
> you might want to consider how we combine the two. 

Yes, in the general case the bulk of objects is indeed contained in the
alternate.

> My sense is that you'd want to be disjoint with respect to anything
> downstream of you.

Ideally yes, but this is unfortunately not easily achievable in the
general case. It's one of the many painpoints that alternates bring with
them.

Suppose two forks A and B are connected to the same alternate. Both A
and B now gain the same set of objects via whatever means. At this point
these objects can be stored in disjoint packs in each of the repos as
they are not yet deduplicated via the alternate. But if you were to pull
objects from either A or B into the alternate then you cannot ensure
disjointedness at all anymore because you would first have to repack
objects in both A and B.

For two forks this might still seem manageable. But as soon as your fork
network grows larger it's clear that this becomes almost impossible to
do. So ultimately, I don't see an alternative to ignoring disjointedness
bits in either of the two linked-together repos.

> Whether or not this is a feature that you/others need, I definitely
> think we should leave it out of this series, since I am (a) fairly
> certain that this is possible to do, and (b) already feel like this
> series on its own is complicated enough.

I'm perfectly fine if we say that the benefits of your patch series
cannot yet be applied to repositories with alternates. But from my point
of view it's a requirement that this patch series does not silently
break this usecase due to Git starting to generate disjointed packs
where it cannot ensure that the disjointedness property holds.

As I haven't yet read through this whole patch series, the question is
boils down to whether the end result is opt-in or opt-out. If it was
opt-out then I could see the above usecase breaking silently. If it was
opt-in then things should be fine and we can address this ommission in a
follow up patch series. We at GitLab would definitely be interested in
helping out with this given that it directly affects us and that the
demonstrated savings seem very promising.

Patrick

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux