Re: [PATCH v2 3/8] builtin/pack-objects.c: add '--stdin-packs' option

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Feb 16, 2021 at 06:46:59PM -0500, Jeff King wrote:
> Do we use this partial traversal to impact the write order at all? That
> would be a nice-to-have, but I suspect that just concatenating the packs
> (presumably by descending mtime) ends up with a similar result.

We don't; the objects are written in pack order. In the version of the
patch you reviewed, the order of packs was determined by their hash (due
to the string_list_sort()), but the version I just prepared re-sorts by
mtime.

It's kind of gross, since we need to use QSORT directly on the
string_list internals in order to have access to the ->util field of the
string_list_items (string_list_sort() only lets you compare strings
directly for obvious reasons).

I added a comment describing this hack.

> > +--stdin-packs::
> > +	Read the basenames of packfiles from the standard input, instead
> > +	of object names or revision arguments. The resulting pack
> > +	contains all objects listed in the included packs (those not
> > +	beginning with `^`), excluding any objects listed in the
> > +	excluded packs (beginning with `^`).
> > ++
> > +Incompatible with `--revs`, or options that imply `--revs` (such as
> > +`--all`), with the exception of `--unpacked`, which is compatible.
>
> I know you say "basename" here, but I wonder if it is worth giving an
> example (`pack-1234abcd.pack`) to make it clear in what form we expect
> it. Or possibly something in the `EXAMPLES` section.

Good idea, thanks.

> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -2979,6 +2979,164 @@ static int git_pack_config(const char *k, const char *v, void *cb)
> >  	return git_default_config(k, v, cb);
> >  }
> >
> > +static int stdin_packs_found_nr;
> > +static int stdin_packs_hints_nr;
>
> I scratched my head at these until I looked further in the code. They're
> the counters for the trace output. Might be worth a brief comment above
> them. (I do approve of adding this kind of trace debugging info; I'm
> pretty accustomed to using gdb or adding one-off debug statements, but
> we really could do a better job in general of making these kinds of
> internals visible to mere mortal admins).

Good call.

> > +static int add_object_entry_from_pack(const struct object_id *oid,
> > +				      struct packed_git *p,
> > +				      uint32_t pos,
> > +				      void *_data)
> > +{
> > +	struct rev_info *revs = _data;
> > +	struct object_info oi = OBJECT_INFO_INIT;
> > +	off_t ofs;
> > +	enum object_type type;
> > +
> > +	display_progress(progress_state, ++nr_seen);
> > +
> > +	ofs = nth_packed_object_offset(p, pos);
> > +
> > +	oi.typep = &type;
> > +	if (packed_object_info(the_repository, p, ofs, &oi) < 0)
> > +		die(_("could not get type of object %s in pack %s"),
> > +		    oid_to_hex(oid), p->pack_name);
>
> Calling out for other reviewers: the oi.typep field will be filled in
> the with _real_ type of the object, even if it's a delta. This is as
> opposed to the return value of packed_object_info(), which may be
> OFS_DELTA or REF_DELTA.
>
> And that real type is what we want here:
>
> > +	else if (type == OBJ_COMMIT) {
> > +		/*
> > +		 * commits in included packs are used as starting points for the
> > +		 * subsequent revision walk
> > +		 */
> > +		add_pending_oid(revs, NULL, oid, 0);
> > +	}
>
> And later when we call create_object_entry().

:-). Yes indeed. As I'm sure that you will recall, the pack-objects
code _does not_ behave well when you give it the packed type of an
object (which is not entirely unexpected, since the pack-objects code
only operates on the true type, so passing the packed type--as I did
when originally writing this patch--is a bug).

> I wondered whether it would be worth adding other objects we might find,
> like trees, in order to increase our traversal. But that doesn't make
> any sense. The whole point is to find the paths, which come from
> traversing from the root trees. And we can only find the root trees by
> starting at commits. Adding any random tree we found would defeat the
> purpose (most of them are sub-trees and would give us a useless partial
> path).

Right.

> Should we avoid adding the commit as a tip for walking if it won't end
> up in the resulting pack? I.e., should we check these:
>
> > +	if (have_duplicate_entry(oid, 0))
> > +		return 0;
> > +
> > +	if (!want_object_in_pack(oid, 0, &p, &ofs))
> > +		return 0;
>
> ...first? I guess it probably doesn't matter too much since we'd
> truncate the traversal as soon as we saw it was in a kept pack anyway.

I agree it doesn't make a difference, but I think placing the extra
guards first makes it easier to read (since the reader doesn't have to
consider how the subsequent traversal would treat it).

> > +static void show_commit_pack_hint(struct commit *commit, void *_data)
> > +{
> > +}
>
> Nothing to do here, since commits don't have a name field. Makes sense.

Yeah. I added a comment to say the same thing, just for extra clarity.

>
> > +static void show_object_pack_hint(struct object *object, const char *name,
> > +				  void *_data)
> > +{
> > +	struct object_entry *oe = packlist_find(&to_pack, &object->oid);
> > +	if (!oe)
> > +		return;
> > +
> > +	/*
> > +	 * Our 'to_pack' list was constructed by iterating all objects packed in
> > +	 * included packs, and so doesn't have a non-zero hash field that you
> > +	 * would typically pick up during a reachability traversal.
> > +	 *
> > +	 * Make a best-effort attempt to fill in the ->hash and ->no_try_delta
> > +	 * here using a now in order to perhaps improve the delta selection
> > +	 * process.
> > +	 */
> > +	oe->hash = pack_name_hash(name);
> > +	oe->no_try_delta = name && no_try_delta(name);
> > +
> > +	stdin_packs_hints_nr++;
> > +}
>
> But for actual objects, we do fill in the hash. I wonder if it's
> possible for oe->hash to have been already filled. I don't think it
> really matters, though. Any value we get is equally valid, so
> overwriting is OK in that case.

Right.

> > +	trace2_data_intmax("pack-objects", the_repository, "stdin_packs_found",
> > +			   stdin_packs_found_nr);
>
> I wonder if it makes sense to report the actual set of packs via trace
> (obviously not as an int, but as a list). That's less helpful for
> debugging pack-objects, if you just fed it the input anyway, but if you
> were debugging "git repack --geometric" it might be useful to see which
> packs it thought were which (though arguably that would be a useful
> trace in builtin/repack.c instead).

I could see an argument in both ways. I'd rather pass for now until we
have a clearer need for it.

> [passing --unpacked to the namehash traversal]
>
> I'm OK to consider that an implementation detail for now, though. We can
> change it later without impacting the interface.

Agreed.

> > +		if (rev_list_unpacked)
> > +			add_unreachable_loose_objects();
>
> Despite the name, that function is adding both reachable and unreachable
> ones. So it is doing what you want. It might be worth renaming, but it's
> not too big a deal since it's local to this file.

Yeah, I tend to err on the side of "it's fine as-is" since this isn't
exposed outside of pack-objects internals. If you feel strongly I'm
happy to change it, but I suspect you don't.

Thanks,
Taylor



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux