Re: [PATCH 08/17] builtin/pack-objects.c: --cruft without expiration

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Dec 07, 2021 at 10:17:28AM -0500, Derrick Stolee wrote:
> On 11/29/2021 5:25 PM, Taylor Blau wrote:
> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > +static int add_cruft_object_entry(const struct object_id *oid, enum object_type type,
> > +				  struct packed_git *pack, off_t offset,
> > +				  const char *name, uint32_t mtime)
> > +{
> > +	struct object_entry *entry;
> > +
> > +	display_progress(progress_state, ++nr_seen);
>
> I don't love the global nr_seen here, but it is pervasive through the
> file. OK.

Yeah; this is how all of the existing progress code works in
pack-objects.

> > +	entry = packlist_find(&to_pack, oid);
> > +	if (entry) {
> > +		if (name) {
> > +			entry->hash = pack_name_hash(name);
> > +			entry->no_try_delta = name && no_try_delta(name);
>
> This is already in an "if (name)" block, so "name &&" isn't needed.

Thanks; this is a copy-and-paste from add_object_entry(), where we
aren't in a conditional on "name". We could also fold the conditional on
whether or not name is NULL into no_try_delta itself, since all existing
calls look like "name && no_try_delta(name)".

So adding something like:

    if (!name)
      return 0;

to the beginning of no_try_delta()'s implementation would allow us to
get rid of the handful of "name &&"s. But I'm trying to avoid touching
other parts of pack-objects as much as I can, so I'll hold off for now.

> > +		}
> > +	} else {
> > +		if (!want_object_in_pack(oid, 0, &pack, &offset))
> > +			return 0;
> > +		if (!pack && type == OBJ_BLOB && !has_loose_object(oid)) {
> > +			/*
> > +			 * If a traversed tree has a missing blob then we want
> > +			 * to avoid adding that missing object to our pack.
> > +			 *
> > +			 * This only applies to missing blobs, not trees,
> > +			 * because the traversal needs to parse sub-trees but
> > +			 * not blobs.
> > +			 *
> > +			 * Note we only perform this check when we couldn't
> > +			 * already find the object in a pack, so we're really
> > +			 * limited to "ensure non-tip blobs which don't exist in
> > +			 * packs do exist via loose objects". Confused?
> > +			 */
> > +			return 0;
> > +		}
> > +
> > +		entry = create_object_entry(oid, type, pack_name_hash(name),
> > +					    0, name && no_try_delta(name),
> > +					    pack, offset);
> > +	}
> > +
> > +	if (mtime > oe_cruft_mtime(&to_pack, entry))
> > +		oe_set_cruft_mtime(&to_pack, entry, mtime);
> > +	return 1;
>
> I was confused at this "return 1" here, while other cases return 0.
>
> It turns out that there are multiple methods in this file that have
> different semantics: add_loose_object() and add_object_entry_from_pack()
> are both called from iterators where "return 1" means "stop iterating"
> so they return 0 always. add_object_entry_from_bitmap() is used to
> iterate over a bitmap and "return 1" means "include this object".
>
> However, the return code for add_cruft_object_entry() is never used,
> so it should probably return void or swap the meanings to have nonzero
> mean an error occurred.

Yes, exactly. And thanks for tracing out both of the different
meanings/interpretations of these add_xyz_entry() functions. As you can
imagine, this implementation is copy-and-pasted from add_object_entry(),
which was specialized for this use here. At the time, I gave some effort
towards trying to share more code with add_object_entry() for this
special case, but it ended up being pretty awkward, hence the separate
implementation.

Ironically, add_object_entry()'s return code is also unused, so we could
probably clean that up, too. But like the above, I'll avoid it for now
in an effort to touch as little of pack-objects in this patch as I can.

> > +static void mark_pack_kept_in_core(struct string_list *packs, unsigned keep)
> > +{
> > +	struct string_list_item *item = NULL;
> > +	for_each_string_list_item(item, packs) {
> > +		struct packed_git *p = item->util;
> > +		if (!p)
> > +			die(_("could not find pack '%s'"), item->string);
>
> Interesting that this is a potential issue. We are expecting the pack
> to be loaded before we get here. Is this more because some packs might
> not actually load, but it's fine as long as we don't mark them as kept?

Not quite "loaded" (though any pack structures that we look at by this
point will be fully "loaded"). Instead, we're making sure that all of
the packs names we read from stdin could be matched to packs that we
found in the repository (i.e., that we produce an appropriate error
message if we found "pack-does-not-exist.pack" on stdin).

This is all because we process input from stdin in two phases:

  - First, read all of the input into two string_lists, one for the
    packs we're about to discard (anything that start with '-'), and
    another for all of the "fresh" packs (i.e., anything that we're not
    going to discard).

  - Then, loop through all of the packed_git structs we have, querying
    both of the aforementioned string lists for input that matches each
    pack's `pack_name` field, and setting the `->util` pointer of the
    matching string_list_entry appropriately.

Following those two steps, any list entries that have a NULL util
pointer correspond with bogus input, so we want to call die() there.

> > +		p->pack_keep_in_core = keep;
> > +	}
> > +}
> ...
> > +static void read_cruft_objects(void)
> > +{
> > +	struct strbuf buf = STRBUF_INIT;
> > +	struct string_list discard_packs = STRING_LIST_INIT_DUP;
> > +	struct string_list fresh_packs = STRING_LIST_INIT_DUP;
> > +	struct packed_git *p;
> > +
> > +	ignore_packed_keep_in_core = 1;
>
> Here is a global that we are suddenly changing. Should we not be
> returning it to its initial state when this method is complete?

We could, although it won't matter in practice, because we'll want to
keep that setting around for our traversal, after which point
pack-objects will exit.

> > +static int option_parse_cruft_expiration(const struct option *opt,
> > +					 const char *arg, int unset)
> > +{
> > +	if (unset) {
> > +		cruft = 0;
>
> This unassignment of 'cruft' when cruft-expiration is unset with
> --no-cruft-expiration seems odd. I would expect
>
> 	git pack-objects --cruft --no-cruft-expiration
>
> to still make a cruft pack, but not expire anything. It seems that
> your code here makes --no-cruft-expiration disable the --cruft option.

Hmm. I could see compelling reasoning that goes both ways. On the one
hand, `--no-cruft-expiration` (to me, at least) seems to imply "set
`--cruft-expiration` to "never"). On the other hand, it also matches our
convention of `--no`-prefixed options to unset some value. This
implementation takes the latter approach, though we could easily change
it to set the cruft expiration to "never".

I don't have a strong opinion about which is better, so I'm happy to do
either if you have a better sense about which has more expected
behavior.

> > +		cruft_expiration = 0;
> > +	} else {
> > +		cruft = 1;
> > +		if (arg)
> > +			cruft_expiration = approxidate(arg);
> > +	}
> > +	return 0;
> > +}
> ..
> > +		OPT_BOOL(0, "cruft", &cruft, N_("create a cruft pack")),
> > +		OPT_CALLBACK_F(0, "cruft-expiration", NULL, N_("time"),
> > +		  N_("expire cruft objects older than <time>"),
> > +		  PARSE_OPT_OPTARG, option_parse_cruft_expiration),
>
> > -static int has_loose_object(const struct object_id *oid)
> > +int has_loose_object(const struct object_id *oid)
> >  {
> >  	return check_and_freshen(oid, 0);
> >  }
>
> I'm surprised this hasn't been modified to use a repository pointer.
> Adding another caller here isn't too much debt, though.

Yeah, check_and_freshen() doesn't have a variant that takes a
repository pointer. Good #leftoverbits, I guess!

> > +int has_loose_object(const struct object_id *);
> > +
> >  void assert_oid_type(const struct object_id *oid, enum object_type expect);
>
> ...
>
> > +	test_expect_success "unreachable packed objects are packed (expire $expire)" '
> > +		git init repo &&
> > +		test_when_finished "rm -fr repo" &&
> > +		(
> > +			cd repo &&
> > +
> > +			test_commit packed &&
> > +			git repack -Ad &&
> > +			test_commit other &&
> > +
> > +			git rev-list --objects --no-object-names packed.. >objects &&
> > +			keep="$(basename "$(ls $packdir/pack-*.pack)")" &&
> > +			other="$(git pack-objects --delta-base-offset \
> > +				$packdir/pack <objects)" &&
> > +			git prune-packed &&
> > +
> > +			test-tool chmtime --get -100 "$packdir/pack-$other.pack" >expect &&
>
> I am missing how this test creates _unreachable_ objects. I would expect removal of
> some refs or a 'git reset --hard' somewhere. What am I missing?

For this and the other tests the so-called "unreachable" objects are
technically reachable, but we can treat them as unreachable by putting
them in the "discard" packs list (or by not mentioning them at all to
`git pack-objects --cruft`).

> > +			# remove the unreachable tree, but leave the commit
> > +			# which has it as its root tree in-tact
>
> nit: "intact" is one word.

Thanks; fixed here and in the other test which was added by this commit.

Thanks,
Taylor



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux