Re: [PATCH v4 0/6] pack-objects: freshen objects with multi-cruft packs

Jeff King <peff@xxxxxxxx> · Thu, 13 Mar 2025 02:29:51 -0400

On Wed, Mar 12, 2025 at 03:04:58PM -0400, Taylor Blau wrote:

> > > I think this is similar to the discussion earlier in the thread, but let
> > > me know if there is something here I'm missing.
> >
> > I think the only thing you are missing is that max specified is the
> > ceiling, and "you can bust it, hoping by a little but you do not
> > know how huge the error is" is unacceptable.
> 
> I agree that in the general case it is unacceptable. I think I might see
> it slightly different than you, since for cruft packs the idea is to
> bound the working set of what you're repacking using the size of the
> resulting packs as a proxy for the former.
> 
> Maybe we should call the option something else that makes the cruft pack
> use-case clearer. But in the other thread I came around to the idea that
> this case is too niche to address completely, so I think we can discard
> this round as a terrible idea.

Having read through the thread, I think the naming is really the issue.
Elijah's earlier response was the most clear to me: the new feature is
not a max size at all, but a --combine-cruft-below-size. Our goal is not
to have small packs, but to avoid rewriting bytes over and over again
(which would happen if we simply rolled all cruft packs into one on each
repack). But also to avoid having a zillion cruft packs (which would
happen if we never rolled them up).

So I think all of the discussion about "would we bust the max limit" is
mostly a red herring.

But also, I think there is no need to tell pack-objects to limit the
size of the resulting cruft pack at all. Let's say you have three cruft
packs with sizes 30MB, 20MB, and 10MB. I want to roll up so that each
cruft pack is at least 40MB. We could say (and this is what I think
your series does, based on our earlier off-list discussions, but please
correct me if things have changed):

  1. All of those are candidates for rolling up, because they're below
     our threshold.

  2. We'll feed the packs along with the "max" size to pack-objects,
     which will then roll it all up into a 40MB pack, plus a 20MB pack
     left over. We'll have written all of the bytes once, but on the
     next repack we'd only rewrite 20MB (plus whatever new cruft comes
     along).

But do we actually care about eventually having a series of 40MB packs?
Or do we care about having some cutoff so that we don't rewrite those
first 40MB on subsequent repacks?

If the latter, then for step 2, what if we don't feed a max size? We'd
end up with one 60MB pack (again, having written all of the bytes once).
And on the next repack we'd leave it be (since it's over the threshold).
We'll start forming new packs, which will eventually aggregate to 40MB
(or possibly larger).

If I understand the main purpose of the series, it is that we must
rescue objects out of cruft packs if they became fresher (by having
loose copies made). But that is orthogonal to max pack sizes, isn't it?
We just need for pack-objects to be fed those objects (which should be
happening already) and decide _not_ to omit them based on their presence
in the kept cruft packs (based on the mtime in those cruft packs, of
course). Which looks like what your want_cruft_object_mtime() is doing.

-Peff