On Thu, Nov 30, 2023 at 02:32:24PM -0500, Taylor Blau wrote: > On Thu, Nov 30, 2023 at 11:18:57AM +0100, Patrick Steinhardt wrote: > > > Instead, teach `pack-objects` a special `--ignore-disjoint` which is the > > > moral equivalent of marking the set of disjoint packs as kept, and > > > ignoring their contents, even if it would have otherwise been packed. In > > > fact, this similarity extends down to the implementation, where each > > > disjoint pack is first loaded, then has its `pack_keep_in_core` bit set. > > > > > > With this in place, we can use the kept-pack cache from 20b031fede > > > (packfile: add kept-pack cache for find_kept_pack_entry(), 2021-02-22), > > > which looks up objects first in a cache containing just the set of kept > > > (in this case, disjoint) packs. Assuming that the set of disjoint packs > > > is a relatively small portion of the entire repository (which should be > > > a safe assumption to make), each object lookup will be very inexpensive. > > > > This cought me by surprise a bit. I'd have expected that in the end, > > most of the packfiles in a repository would be disjoint. Using for > > example geometric repacks, my expectation was that all of the packs that > > get written via geometric repacking would eventually become disjoint > > whereas new packs added to the repository would initially not be. > > Which part are you referring to here? If you're referring to the part > where I say that the set of disjoint packs is relatively small in > proposition to the rest of the packs, I think I know where the confusion > is. Yeah, that's what I was referring to. > I'm not saying that the set of disjoint packs is small in comparison to > the rest of the repository by object count, but rather by count of packs > overall. You're right that packs from pushes will not be guaranteed to > be disjoint upon entering the repository, but will become disjoint when > geometrically repacked (assuming that the caller uses --ignore-disjoint > when repacking). I was actually thinking about it in the number of packfiles, not number of objects. I'm mostly coming from the angle of geometric repacking here, where it is totally expected that you have a comparatively large number of packfiles when your repository is big. With a geometric factor of 2, you'll have up to `log2($numobjects)` many packfiles in your repo while keeping the geometric sequence intact. In something like linux.git with almost 10M objects that boils down to 23 packfiles, and I'd assume that all of these would be disjoint in the best case. So if you gain new packfiles by pushing into the repository then I'd think that it's quite likely that the number of non-disjoint packfiles is smaller than the number of disjoint ones. I do realize though that in absolute numbers, this isn't all that many. I was also thinking ahead though to a future where we have something like geometric repacking with maximum packfile sizes working well together so that we'll be able to merge packfiles together until they reach a certain maximum size, and afterwards they are just left alone. This would help to avoid those "surprise" repack cases where everything is again packed into a single packfile for the biggest repositories out there. But it would of course also lead to an increase in packfiles in those huge repositories. Anyway, I feel like I'm rambling. In the end it's probably going to be fine, I was simply surprised by your assumption that the number of disjoint packfiles should usually be much smaller than the number of non-disjoint ones. Patrick
Attachment:
signature.asc
Description: PGP signature