Re: [RFC PATCH 02/10] range-diff.c: don't use st_mult() for signed "int"

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Fri, 10 Dec 2021 13:31:10 +0100

On Fri, Dec 10 2021, Jeff King wrote:

> On Fri, Dec 10, 2021 at 11:22:59AM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> > Dropping the st_mult() does nothing to fix the actual problem (which is
>> > that this function should use a more appropriate type), but introduces
>> > new failure modes.
>> 
>> Yes you're entirely right. I had some stupid blinders on while writing
>> this. FWIW I think I was experimenting with some local macros and
>> conflated a testing of the overflow of n*n in gdb with the caste'd
>> version, which you rightly point out here won't have the overflow issue
>> at all. Sorry.
>
> I'm not sure if this is helpful or not, but this is the minimal fix I
> came up with that runs the testcase I showed earlier. It's basically
> just swapping out "int" for "ssize_t" for any variables we use to index
> the arrays (though note a few are themselves held in arrays, and we have
> to cross some function boundaries).
>
> I won't be surprised if it doesn't hit all cases, or if it even hits a
> few it doesn't need to (e.g., should "phase" be dragged along with "i"
> and "j" in the first hunk?). I mostly did guess-and-check on the
> test-case, fixing whatever segfaulted and then running again until it
> worked. I didn't even really read the code very carefully.
>
> I think you _did_ do more of that careful reading, and broke down the
> refactorings into separate patches in your series. Which is good. So I
> think what we'd want is to pick out those parts of your series that end
> up switching the variable type. My goal in sharing this here is just to
> show that the end result of the fix can (and IMHO should) be around this
> same order of magnitude.
>
> [...]
>  void compute_assignment(int column_count, int row_count, int *cost,
> -			int *column2row, int *row2column);
> +			ssize_t *column2row, ssize_t *row2column);
>  
>  /* The maximal cost in the cost matrix (to prevent integer overflows). */
>  #define COST_MAX (1<<16)
> diff --git a/range-diff.c b/range-diff.c
> index cac89a2f4f..f1e1e27bf9 100644
> --- a/range-diff.c
> +++ b/range-diff.c
> @@ -308,9 +308,10 @@ static int diffsize(const char *a, const char *b)
>  static void get_correspondences(struct string_list *a, struct string_list *b,
>  				int creation_factor)
>  {
> -	int n = a->nr + b->nr;
> -	int *cost, c, *a2b, *b2a;
> -	int i, j;
> +	size_t n = a->nr + b->nr;
> +	int *cost, c;
> +	ssize_t *a2b, *b2a;
> +	size_t i, j;
>  
>  	ALLOC_ARRAY(cost, st_mult(n, n));
>  	ALLOC_ARRAY(a2b, n);

I think I was just chasing butterflies making this intmax_t at all. I
just submitted a v2, and explained that case in a bit more detail in
https://lore.kernel.org/git/RFC-cover-v2-0.5-00000000000-20211210T122901Z-avarab@xxxxxxxxx

I *think* it fixes all the cases we plausible run into, i.e. storing the
"cost" in an "int" was enough, we just needed a size_t as an offset. It
passes the regression test you noted[3].

The first thing I tried when hacking on this some months ago (I picked
these patches up again after running into the segfault again) was this
s/int/ssize_t/ change.

I don't think using ssize_t like that is portable, and that we'd need
something like intmax_t if we needed this in another context.

Firstly it's not standard C, it's just in POSIX, intmax_t is standard C
as of C99, which and we have in-tree code that already depends on it
(and uintmax_t).

But more importantly it's not "as big as size_t, just signed" in
POSIX. size_t is "no greater than the width of type long"[1] and
LONG_MAX is at least 2^31-1 [2].

Whereas ssize_t is not a "signed size_t", but a type that stores
-1..SSIZE_MAX, and SSIZE_MAX has a minimum value of 2^15-1. I.e. I think
on that basis some implemenations would make it the same as a "short
int" under the hood.

On my linux system it's just mapped to the longest available signed
integer, but that doesn't seem to be a portable assumption.

1. https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_types.h.html
2. https://pubs.opengroup.org/onlinepubs/009696899/basedefs/limits.h.html

3. B.t.w. a thing I ended up ejecting out of this was that I made a
   "test_commit_bulkier" which is N times faster than "test_commit_bulk",
   it just makes the same commit N times with the printf-repeating feature
   and feeds it to fast-import, but the test took so long in any case that
   I couldn't find a plausible way to get it in-tree).