Re: [PATCH v2 06/11] get_short_oid: sort ambiguous objects by type, then SHA-1

Derrick Stolee <stolee@xxxxxxxxx> · Tue, 1 May 2018 09:44:52 -0400






On 5/1/2018 9:39 AM, Ævar Arnfjörð Bjarmason wrote:
On Tue, May 01 2018, Derrick Stolee wrote:

From: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>

Here is what I mean by sorting during for_each_abbrev(). This seems to work for
me, so I don't know what the issue is with this one-pass approach.
[...]
+static int sort_ambiguous(const void *a, const void *b)
+{
+	int a_type = oid_object_info(a, NULL);
+	int b_type = oid_object_info(b, NULL);
+	int a_type_sort;
+	int b_type_sort;
+
+	/*
+	 * Sorts by hash within the same object type, just as
+	 * oid_array_for_each_unique() would do.
+	 */
+	if (a_type == b_type)
+		return oidcmp(a, b);
+
+	/*
+	 * Between object types show tags, then commits, and finally
+	 * trees and blobs.
+	 *
+	 * The object_type enum is commit, tree, blob, tag, but we
+	 * want tag, commit, tree blob. Cleverly (perhaps too
+	 * cleverly) do that with modulus, since the enum assigns 1 to
+	 * commit, so tag becomes 0.
+	 */
+	a_type_sort = a_type % 4;
+	b_type_sort = b_type % 4;
+	return a_type_sort > b_type_sort ? 1 : -1;
+}
+
  static int get_short_oid(const char *name, int len, struct object_id *oid,
  			  unsigned flags)
  {
@@ -451,6 +479,9 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn fn, void *cb_data)
  	find_short_object_filename(&ds);
  	find_short_packed_object(&ds);

+	QSORT(collect.oid, collect.nr, sort_ambiguous);
+	collect.sorted = 1;
+
Yes this works. You're right. I wasn't trying to intentionally omit
stuff in my recent 878t93zh60.fsf@xxxxxxxxxxxxxxxxxxx, I'd just written
this code some days ago and forgotten why I did what I was doing (and
this is hard to test for), but it's all coming back to me now.

The actual requirement for oid_array_for_each_unique() working properly
is that you've got to feed it in hash order,

To work properly, duplicate entries must be consecutive. Since duplicate 
entries have the same type, our sort satisfies this condition.

but my new sort_ambiguous()
still does that (barring any SHA-1 collisions, at which point we have
bigger problems), so two passes aren't needed. So yes, this apporoach
works and is one-pass.

But that's just an implementation detail of the current sort method,
when I wrote this I was initially playing with other sort orders,
e.g. sorting SHAs regardless of type by the mtime of the file I found
them in. With this approach I'd start printing duplicates if I changed
the internals of sort_ambiguous() like that.

That makes sense.

But I think it's extremely implausible that we'll start sorting things
like that, so I'll just take this method of doing it and add some
comment saying we must hashcmp() the entries in our own sort function
for the de-duplication to work, I don't see us ever changing that.

Sounds good.

Thanks,
-Stolee