Re: [PATCH 08/24] pack-objects: implement `--ignore-disjoint` mode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Nov 30, 2023 at 02:32:24PM -0500, Taylor Blau wrote:
> On Thu, Nov 30, 2023 at 11:18:57AM +0100, Patrick Steinhardt wrote:
> > > Instead, teach `pack-objects` a special `--ignore-disjoint` which is the
> > > moral equivalent of marking the set of disjoint packs as kept, and
> > > ignoring their contents, even if it would have otherwise been packed. In
> > > fact, this similarity extends down to the implementation, where each
> > > disjoint pack is first loaded, then has its `pack_keep_in_core` bit set.
> > >
> > > With this in place, we can use the kept-pack cache from 20b031fede
> > > (packfile: add kept-pack cache for find_kept_pack_entry(), 2021-02-22),
> > > which looks up objects first in a cache containing just the set of kept
> > > (in this case, disjoint) packs. Assuming that the set of disjoint packs
> > > is a relatively small portion of the entire repository (which should be
> > > a safe assumption to make), each object lookup will be very inexpensive.
> >
> > This cought me by surprise a bit. I'd have expected that in the end,
> > most of the packfiles in a repository would be disjoint. Using for
> > example geometric repacks, my expectation was that all of the packs that
> > get written via geometric repacking would eventually become disjoint
> > whereas new packs added to the repository would initially not be.
> 
> Which part are you referring to here? If you're referring to the part
> where I say that the set of disjoint packs is relatively small in
> proposition to the rest of the packs, I think I know where the confusion
> is.

Yeah, that's what I was referring to.

> I'm not saying that the set of disjoint packs is small in comparison to
> the rest of the repository by object count, but rather by count of packs
> overall. You're right that packs from pushes will not be guaranteed to
> be disjoint upon entering the repository, but will become disjoint when
> geometrically repacked (assuming that the caller uses --ignore-disjoint
> when repacking).

I was actually thinking about it in the number of packfiles, not number
of objects. I'm mostly coming from the angle of geometric repacking
here, where it is totally expected that you have a comparatively large
number of packfiles when your repository is big. With a geometric factor
of 2, you'll have up to `log2($numobjects)` many packfiles in your repo
while keeping the geometric sequence intact.

In something like linux.git with almost 10M objects that boils down to
23 packfiles, and I'd assume that all of these would be disjoint in the
best case. So if you gain new packfiles by pushing into the repository
then I'd think that it's quite likely that the number of non-disjoint
packfiles is smaller than the number of disjoint ones.

I do realize though that in absolute numbers, this isn't all that many.
I was also thinking ahead though to a future where we have something
like geometric repacking with maximum packfile sizes working well
together so that we'll be able to merge packfiles together until they
reach a certain maximum size, and afterwards they are just left alone.
This would help to avoid those "surprise" repack cases where everything
is again packed into a single packfile for the biggest repositories out
there. But it would of course also lead to an increase in packfiles in
those huge repositories.

Anyway, I feel like I'm rambling. In the end it's probably going to be
fine, I was simply surprised by your assumption that the number of
disjoint packfiles should usually be much smaller than the number of
non-disjoint ones.

Patrick

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux