Re: [PATCH 09/24] repack: implement `--extend-disjoint` mode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Dec 07, 2023 at 03:28:18PM -0500, Taylor Blau wrote:
> On Thu, Dec 07, 2023 at 02:13:08PM +0100, Patrick Steinhardt wrote:
> > >   - cruft packs (which may necessarily need to include an object from a
> > >     disjoint pack in order to freshen it in certain circumstances)
> >
> > This one took me a while to figure out. If we'd mark crufts as disjoint,
> > then it would mean that new packfiles cannot be marked as disjoint if
> > objects which were previously unreachable do become reachable again.
> > So we'd be pessimizing packfiles for live objects in favor of others
> > which aren't.
> 
> Yeah, that's right, too. There are a couple of cases where more than one
> cruft pack may contain the same object, one of them being the
> flip-flopping between reachable and unreachable as you suggest above.
> Another is that you have a non-prunable unreachable object which is
> already in a cruft pack. If the object's mtime gets updated (and still
> cannot be pruned), we'll end up freshening the object loose, and then
> packing it again (with the more recent mtime) into a new cruft pack.
> 
> That aside, I actually think that there are ways to mark cruft packs
> disjoint. But they're complicated, and moreover, I don't think you'd
> ever *want* to mark a cruft pack as disjoint. Cruft packs usually
> contain garbage, which is unlikely to be useful to any fetches/clones.
> 
> If we did mark them as disjoint, it would mean that we could reuse
> verbatim sections of the cruft pack in our output, but we would likely
> end up with very few such sections.

Makes sense. It also doesn't feel worth it to introduce additional
complexity for objects that for most of the part wouldn't ever be served
on a fetch anyway.

[snip]
> > Okay. I had a bit of trouble to sift through the various different
> > flags like "--retain-disjoint", "--extend-disjoint", "--ignore-disjoint"
> > and so on. But well, they do different things and it's been a few days
> > since I've reviewed the preceding patches, so this is probably fine.
> 
> Yeah, I am definitely open to better naming conventions here? I figured
> that:
> 
>   - --retain-disjoint was a good name for the MIDX option, since it is
>     retaining existing disjoint packs in the new MIDX
>   - --extend-disjoint was a good name for the repack option, since it is
>     extending the set of disjoint packs
>   - --ignore-disjoint was a good name for the pack-objects option, since
>     it is ignoring objects in disjoint packs
> 
> Writing this out, I think that you could make an argument that
> `--exclude-disjoint` is a better name for the last option. So I'm
> definitely open to suggestions here, but I don't want to get too bogged
> down on command-line option naming (so long as we're all reasonably
> happy with the result).

Yeah, as said, I don't mind it too much. It's a complex area and the
flags all do different things, so it's expected that you may have to do
some research on what exactly they do. That being said, I do like your
proposed `--exclude-disjoint` a lot more than `--ignore-disjoint`.

> > One thing I wondered: do we need to consider the `-l` flag? When using
> > an alternate object directory it is totally feasible that the alternate
> > may be creating new disjoint packages without us knowing, and thus we
> > may not be able to guarantee the disjoint property anymore.
> 
> I don't think so. We'd only care about one direction of this (that
> alternates do not create disjoint packs which overlap with ours, instead
> of the other way around), but since we don't put non-local packs in the
> MIDX, I think we're OK.
> 
> I suppose that you might run into trouble if you use the chained MIDX
> thing (via its `->next` pointer). I haven't used that feature myself, so
> I'd have to play around with it.

We do use this feature at GitLab for forks, where forks connect to a
common alternate object directory to deduplicate objects. As both the
fork repository and the alternate object directory use an MIDX I think
they would be set up exactly like that.

I guess the only really viable solution here is to ignore disjoint packs
in the main repo that connects to the alternate in the case where the
alternate has any disjoint packs itself.

Patrick

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux