Re: [PATCH v3] builtin/pack-objects.c: introduce `pack.extraCruftTips`

Jeff King <peff@xxxxxxxx> · Fri, 5 May 2023 17:39:46 -0400

On Wed, May 03, 2023 at 04:18:44PM -0700, Junio C Hamano wrote:

> > +	When generating a cruft pack, use the shell to execute the
> > +	specified command(s), and interpret their output as additional
> > +	tips of objects to keep in the cruft pack, regardless of their
> 
> What is a "tip of an object"?  The first byte ;-)?
> 
> A "tip of history" would only imply commit objects, but presumably
> you would want to specify a tree and protect all the blobs and trees
> it recursively contains, so that is not a good name for it.

"tips of the object graph" perhaps?

> > +		if (parse_oid_hex(buf.buf, &oid, &rest) || *rest) {
> > +			ret = error(_("invalid extra cruft tip: '%s'"), buf.buf);
> > +			goto done;
> > +		}
> > +
> > +		type = oid_object_info(the_repository, &oid, NULL);
> > +		if (type < 0)
> > +			continue;
> > +
> > +		obj = lookup_object_by_type(the_repository, &oid, type);
> > +		if (!obj)
> > +			continue;
> 
> Hmph, we may want to have an interface that lets us avoid looking up
> the same oid twice in the same set of tables.  Given an object
> unseen so far, oid_object_info() should have done most of the work
> necessary for lookup_object_by_type() to get to and start parsing
> the data of the object in the good case (i.e. object exists and in a
> pack---just we haven't needed it yet), but in the above sequence
> there is not enough information passed between two calls to take
> advantage of it.

This code was my suggestion, but it may have actually been a bad
direction.

I don't think communicating between oid_object_info() and
lookup_object_by_type() is important. The latter is only doing a lookup
in the internal hash with lookup_object(), and then auto-vivifying using
the type if necessary (which we provide to it).

The bigger inefficiency is that we call oid_object_info() before seeing
if we have already instantiated an object struct via lookup_object().

Obviously we could do that first. But let's take a step back. My
original suggestion was thinking that we don't want to call
parse_object() because it's expensive, especially for a blob. But in the
long run, most of these objects (except blobs!) will end up parsed
anyway, because we are going to see which other objects they reach.

So it's OK to parse anything except blobs. And indeed, we have a better
tool for that these days:

  obj = parse_object_with_flags(r, oid, PARSE_OBJECT_SKIP_HASH_CHECK);

That does exactly what we want. If we already saw and parsed the object,
it's quick noop after a hash lookup. If we didn't, then it already has
optimizations to avoid reading object contents if possible (checking the
commit graph, checking the type for blobs).

Skipping the hash check might seem like a bad idea for a repack, but
it's what we already do for blobs found via traversing. A disk repack
uses the much cheaper pack idx crc for exactly this purpose: to avoid
expanding objects unnecessarily.

-Peff