Re: [PATCH] ref-filter: sort numerically when ":size" is used

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 01 Sep 2023 09:43:07 -0700

Kousik Sanagavarapu <five231003@xxxxxxxxx> writes:

> Atoms like "raw" and "contents" have a ":size" option which can be used
> to know the size of the data. Since these atoms have the cmp_type
> FIELD_STR, they are sorted alphabetically from 'a' to 'z' and '0' to
> '9'. Meaning, even when the ":size" option is used and what we
> ultimatlely have is numbers, we still sort alphabetically.

There are other cmp_types already defined, like ULONG and TIME.  How
are they used and affect the comparison?  Naively, :size sounds like
a good candidate to compare as ULONG, as it cannot be negative even
though 0 is a valid size.

I understand and agree with the motivation, but the implementation
looks puzzling.

> diff --git a/ref-filter.c b/ref-filter.c
> index 1bfaf20fbf..5d7bea5f23 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -932,7 +932,13 @@ struct atom_value {
>  	ssize_t s_size;
>  	int (*handler)(struct atom_value *atomv, struct ref_formatting_state *state,
>  		       struct strbuf *err);
> -	uintmax_t value; /* used for sorting when not FIELD_STR */
> +
> +	/*
> +	 * Used for sorting when not FIELD_STR or when FIELD_STR but the
> +	 * sort should be numeric and not alphabetic.
> +	 */
> +	uintmax_t value;

This does not explain why we cannot make <anything>:size FIELD_ULONG
for <anything> that is of FIELD_STR type, though.  IOW, why such a
strange "when FIELD_STR but the sort should be numeric" is needed?
If you have a <size>, shouldn't it always be numeric?

IOW, when you notice that you need to set, say, u.contents.option of
an atom to C_LENGTH, shouldn't you set cmp_type of the atom to
FIELD_ULONG, somewhere in contents_atom_parser() and friends, and
everything should naturally follow, no?

It seems that support for other cmp_types are incomplete in the
current code.  There are FIELD_ULONG and FIELD_TIME defined, but
they do not appear to be used in any way, so the cmp_ref_sorting()
would need to be updated to make it actually pay attention to the
cmp_type and perform numeric comparison.

> @@ -1883,8 +1890,10 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, struct exp
>  			v->s = strbuf_detach(&sb, NULL);
>  		} else if (atom->u.contents.option == C_BODY_DEP)
>  			v->s = xmemdupz(bodypos, bodylen);
> -		else if (atom->u.contents.option == C_LENGTH)
> -			v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos));
> +		else if (atom->u.contents.option == C_LENGTH) {
> +			v->value = (uintmax_t)strlen(subpos);
> +			v->s = xstrfmt("%"PRIuMAX, v->value);
> +		}

We should take a note that all of these v->value are *per* *item*
that will be sorted.

> @@ -2986,7 +2996,7 @@ static int cmp_ref_sorting(struct ref_sorting *s, struct ref_array_item *a, stru
>  		cmp_detached_head = 1;
>  	} else if (s->sort_flags & REF_SORTING_VERSION) {
>  		cmp = versioncmp(va->s, vb->s);
> -	} else if (cmp_type == FIELD_STR) {
> +	} else if (cmp_type == FIELD_STR && !va->value && !vb->value) {

Two refs may point at an empty object with zero length, i.e.  for
them, !v->value is true, and another ref may point at a non-empty
object.  The two empty refs are compared with an algorithm different
from the algorithm used to compare the empty ref and the non-empty
ref.  Isn't this broken as a comparison function to be given to
QSORT(), which must be transitive (e.g. if A < B and B < C, then it
should be guaranteed that A < C and you do not have to compare A and
C)?

IOW, the choice of the comparison algorithm should not depend on an
attribute (like value or s) that is specific to the item being
compared.  Things like cmp_type that is defined at the used_atom
lefvel to make the sorting function stable, I would think.