Re: [PATCH v2 02/30] oid-array: teach oid-array to handle multiple kinds of oids

Linus Arver <linusa@xxxxxxxxxx> · Thu, 15 Feb 2024 16:16:57 -0800

"Eric W. Biederman" <ebiederm@xxxxxxxxx> writes:

> Linus Arver <linusa@xxxxxxxxxx> writes:
>
>> "Eric W. Biederman" <ebiederm@xxxxxxxxx> writes:
>>
>>> From: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
>>>
>>> While looking at how to handle input of both SHA-1 and SHA-256 oids in
>>> get_oid_with_context, I realized that the oid_array in
>>> repo_for_each_abbrev might have more than one kind of oid stored in it
>>> simultaneously.
>>>
>>> Update to oid_array_append to ensure that oids added to an oid array
>>
>> s/Update to/Update
>>
>>> always have an algorithm set.
>>>
>>> Update void_hashcmp to first verify two oids use the same hash algorithm
>>> before comparing them to each other.
>>>
>>> With that oid-array should be safe to use with different kinds of
>>
>> s/oid-array/oid_array
>>
>>> oids simultaneously.
>>>
>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
>>> ---
>>>  oid-array.c | 12 ++++++++++--
>>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/oid-array.c b/oid-array.c
>>> index 8e4717746c31..1f36651754ed 100644
>>> --- a/oid-array.c
>>> +++ b/oid-array.c
>>> @@ -6,12 +6,20 @@ void oid_array_append(struct oid_array *array, const struct object_id *oid)
>>>  {
>>>  	ALLOC_GROW(array->oid, array->nr + 1, array->alloc);
>>>  	oidcpy(&array->oid[array->nr++], oid);
>>> +	if (!oid->algo)
>>> +		oid_set_algo(&array->oid[array->nr - 1], the_hash_algo);
>>
>> How come we can't set oid->algo _before_ we call oidcpy()? It seems odd
>> that we do the copy first and then modify what we just copied after the
>> fact, instead of making sure that the thing we want to copy is correct
>> before doing the copy.
>>
>> But also, if we are going to make the oid object "correct" before
>> invoking oidcpy(), we might as well do it when the oid is first
>> created/used (in the caller(s) of this function). I don't demand that
>> you find/demonstrate where all these places are in this series (maybe
>> that's a hairy problem to tackle?), but it seems cleaner in principle to
>> fix the creation of oid objects instead of having to make oid users
>> clean up their act like this after using them.
>
> There is a hairy problem here.
>
> I believe for reasons of simplicity when the algo field was added to
> struct object_id it was allowed to be zero for users that don't
> particularly care about the hash algorithm, and are happy to use the git
> default hash algorithm.
>
> Me experience working on this set of change set showed that there
> are oids without their algo set in all kinds of places in the tree.

Ah, I see. Thanks for the clarification.

> I could not think of any sure way to go through the entire tree
> and find those users, so I just made certain that oid array handled
> that case.
>
> I need algo to be set properly in the oids in the oid array so I
> could extend oid_array to hold multiple kinds of oids at the same
> time.  To allow multiple kinds of oids at the same time void_hashcmp
> needs a simple and reliable way to tell what the algorithm is of
> any given oid.

Makes sense.

>>
>>>  	array->sorted = 0;
>>>  }
>>>  
>>> -static int void_hashcmp(const void *a, const void *b)
>>> +static int void_hashcmp(const void *va, const void *vb)
>>>  {
>>> -	return oidcmp(a, b);
>>> +	const struct object_id *a = va, *b = vb;
>>> +	int ret;
>>> +	if (a->algo == b->algo)
>>> +		ret = oidcmp(a, b);
>>
>> This makes sense (per the commit message description) ...
>>
>>> +	else
>>> +		ret = a->algo > b->algo ? 1 : -1;
>>
>> ... but this seems to go against it? I thought you wanted to only ever
>> compare hashes if they were of the same algo? It would be good to add a
>> comment explaining why this is OK (we are no longer doing a byte-by-byte
>> comparison of these oids any more here like we do for oidcmp() above
>> which boils down to calling memcmp()).
>
> So the goal of this change is for oid_array to be able to hold hashes
> from multiple algorithms at the same time.
>
> A key part of oid_array is oid_array_sort that allows functions such
> as oid_array_lookup and oid_array_for_each_unique.
>
> To that end there needs to be a total ordering of oids.
>
> The function oidcmp is only defined when two oids are of the same
> algorithm, it does not even test to detect the case of comparing
> mismatched algorithms.
>
> Therefore to get a total ordering of oids.  I must use oidcmp
> when the algorithm is the same (the common case) or simply order
> the oids by algorithm when the algorithms are different.
>
>
>
> All of this is relevant to get_oid_with_context as get_oid_with_context
> and it's helper functions contain the logic that determines what
> we do when a hex string that is ambiguous is specified.
>
> In the ambiguous case all of the possible candidates are placed in
> an oid_array, sorted and then displayed.
>
>
> With a repository that can knows both the sha1 and the sha256 oid
> of it's objects it is possible for a short oid to match both
> some sha1 oids and some sha256 oids.

Thanks for the additional clarification. I think a lot of this could
have been added as comments or perhaps in the commit message. The "short
id can match both sha1 or sha256" is a very real scenario we need to
consider in the sha1+sha256 world, indeed.

>>> +	return ret;
>>
>> Also, in terms of style I think the "early return for errors" style
>> would be simpler to read. I.e.
>>
>>     if (a->algo > b->algo)
>>         return 1;
>>
>>     if (a->algo < b->algo)
>>         return -1;
>>
>>     return oidcmd(a, b);
>>
>
> I can see doing:
> 	if (a->algo == b->algo)
>         	return oidcmp(a,b);
>
> 	if (a->algo > b->algo)
>         	return 1;
>         else
>         	return -1;
>
> Or even:
> 	if (a->algo == b->algo)
>         	return oidcmp(a,b);
>
> 	return a->algo - b->algo;
>
> Although I suspect using subtraction is a bit too clever.

Agreed.

> Comparing for less than, and greater than, and then assuming
> the values are equal hides what is important before calling
> oidcmp which is that the algo values are equal.

I would still prefer the "early return for errors" style even in this
case. This is because I much prefer to have the question "how can things
go wrong?" answered first, and dealt with, such that as I read
top-to-bottom I am left with less things I have to consider to
understand the "happy path". WRT emphasizing the "algos equal each
other" concern, a simple comment like

     /* Only compare equal algorithms. */
     return oidcmp(a, b);

seems sufficient.

But, of course it is possible (perhaps even likely) that my preferred
style is in the minority. Up to you. Thanks.