Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1

Derrick Stolee <stolee@xxxxxxxxx> · Tue, 1 May 2018 08:26:47 -0400




On 5/1/2018 7:27 AM, Ævar Arnfjörð Bjarmason wrote:
On Tue, May 01 2018, Derrick Stolee wrote:

On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote:
Since we show the commit data in the output that's nicely aligned once
we sort by object type. The decision to show tags before commits is
pretty arbitrary, but it's much less likely that we'll display a tag,
so if there is one it makes sense to show it first.
Here's a non-arbitrary reason: the object types are ordered
topologically (ignoring self-references):

tag -> commit, tree, blob
commit -> tree
tree -> blob
Thanks. I'll add a patch with that comment to v2.

@@ -421,7 +451,12 @@ static int get_short_oid(const char *name, int len, struct object_id *oid,
   			ds.fn = NULL;
     		advise(_("The candidates are:"));
-		for_each_abbrev(ds.hex_pfx, show_ambiguous_object, &ds);
+		for_each_abbrev(ds.hex_pfx, collect_ambiguous, &collect);
+		QSORT(collect.oid, collect.nr, sort_ambiguous);
I was wondering how the old code sorted by SHA even when the ambiguous
objects were loaded from different sources (multiple pack-files, loose
objects). Turns out that for_each_abbrev() does its own sort after
collecting the SHAs and then calls the given function pointer only
once per distinct object. This avoids multiple instances of the same
object, which may appear multiple times across pack-files.

I only ask because now we are doing two sorts. I wonder if it would be
more elegant to provide your sorting algorithm to for_each_abbrev()
and let it call show_ambiguous_object as before.

Another question is if we should use this sort generally for all calls
to for_each_abbrev(). The only other case I see is in
builtin/revparse.c.
When preparing v2 I realized how confusing this was, so I'd added this
to the commit message of my WIP re-roll which should explain this:

     A note on the implementation: I started out with something much
     simpler which just replaced oid_array_sort() in sha1-array.c with a
     custom sort function before calling oid_array_for_each_unique(). But
     then dumbly noticed that it doesn't work because the output function
     was tangled up with the code added in fad6b9e590 ("for_each_abbrev:
     drop duplicate objects", 2016-09-26) to ensure we don't display
     duplicate objects.
     
     That's why we're doing two passes here, first we need to sort the list
     and de-duplicate the objects, then sort them in our custom order, and
     finally output them without re-sorting them. I suppose we could also
     make oid_array_for_each_unique() maintain a hashmap of emitted
     objects, but that would increase its memory profile and wouldn't be
     worth the complexity for this one-off use-case,
     oid_array_for_each_unique() is used in many other places.

How would sorting in our custom order before de-duplicating fail the 
de-duplication? We will still pair identical OIDs as consecutive 
elements and oid_array_for_each_unique only cares about consecutive 
elements having distinct OIDs, not lex-ordered OIDs.

Perhaps the noise is because we rely on oid_array_sort() to mark the 
array as sorted inside oid_array_for_each_unique(), but that could be 
remedied by calling our QSORT() inside for_each_abbrev() and marking the 
array as sorted before calling oid_array_for_each_unique().

(Again, my comments are not meant to block this series.)

Thanks,
-Stolee