Re: [PATCH v2 1/1] diff.c: When appropriate, use utf8_strwidth()

Torsten Bögershausen <tboegi@xxxxxx> · Mon, 29 Aug 2022 19:54:25 +0200

On Mon, Aug 29, 2022 at 02:04:42PM +0200, Johannes Schindelin wrote:
> Hi Torsten,
> >
> > The choosen solution is to split code in diff.c like this
> >
> > strbuf_addf(&out, "%-*s", len, name);
> >
> > into something like this:
> >
> > size_t num_padding_spaces = 0;
> > // [snip]
> > if (len > utf8_strwidth(name))
> >     num_padding_spaces = len - utf8_strwidth(name);
> > strbuf_addf(&out, "%s", name);
> > if (num_padding_spaces)
> >     strbuf_addchars(&out, ' ', num_padding_spaces);
>
> ... this sounds like it would benefit from beinv refactored into a
> separate function, e.g. `strbuf_add_padded(buf, utf8string)`, both for
> readability as well as for self-documentation.

Yes, but:
All (tm) strbuf() functions use an unsigned size_t, and are not
tolerant against passing 0 as "do nothing".
A nicer solution (for this patch) could be a change like this:
Instead of

void strbuf_addchars(struct strbuf *sb, int c, size_t n)
{
        strbuf_grow(sb, n);
	memset(sb->buf + sb->len, c, n);
	strbuf_setlen(sb, sb->len + n);
}

We would find:
void strbuf_addchars(struct strbuf *sb, int c, ssize_t n)
{
        if (n <= 0)
	       return;
        strbuf_grow(sb, (size_t)n);
	memset(sb->buf + sb->len, c, (size_t)n);
	strbuf_setlen(sb, sb->len + (size_t)n);
}

I couldn't convince myself to do so.
Since it is mainly diff.c that needs this adjustment/padding of strings,
I coulnd't convince myself to write another function in strbuf.c

>
> Also, it is unclear to me why we have to evaluate `utf8_strwidth()`
> _twice_ and why we do not assign the result to a variable called `width`
> and then have a conditional like
>
> 	if (width < len) /* pad to `len` columns */
> 		strbuf_addchars(&out, ' ' , len - width);
>
> instead. That would sound more logical to me.

This is caused by the logic in diff.c:
  /*
   * Find the longest filename and max number of changes
   */
   for (i = 0; (i < count) && (i < data->nr); i++) {
       struct diffstat_file *file = data->files[i];
       [snip]
       len = utf8_strwidth(file->print_name);
       if (max_width < len)
          max_width = len;
// and later
    /*
     * From here name_width is the width of the name area,
     * and graph_width is the width of the graph area.
     * max_change is used to scale graph properly.
     */
    for (i = 0; i < count; i++) {
    /*
     * "scale" the filename
     */
     // TB: Which means either shortening it with ...
     // Or padding it, if needed, and here we need
     // another
     name_len = utf8_strwidth(name);

>
> Besides, since the simple change from `strlen()` to `utf8_strwidth()` is
> so different from changing `strbuf_addf(...)`, I would prefer to see them
> split into two patches.

Hm, that is a possiblity. Seems to ease the burden for reviewers.

>
> >
> > Tests:
> > Two things need to be tested:
> > - The calculation of the maximum width
> > - The calculation of num_padding_spaces
> >
> > The name "textfile" is changed into "textfilë", both have a width of 8.
> > If strlen() was used, to get the maximum width, the shorter "binfile" would
> > have been mis-aligned:
> >  binfile   |  [snip]
> >  textfilë | [snip]
> >
> > If only "binfile" would be renamed into "binfilë":
> >  binfilë |  [snip]
> >  textfile | [snip]
> >
> > In order to verify that the width is calculated correctly everywhere,
> > "binfile" is renamed into "binfïlë", giving 2 bytes more in strlen()
> > "textfile" is renamed into "textfilë", 1 byte more in strlen(),
> > and the updated t4012-diff-binary.sh checks the correct aligment:
> >  binfïlë  | [snip]
> >  textfilë | [snip]
>
> I wonder whether you can change only _one_ name and still verify the
> correctness. When you make two changes at the same time, it is always
> possible for one change to "cancel out" the other one, and therefore it is
> harder to reason about the correctness of your patch.

Nee, I have a hard time to see how a +/- 1 can "cancel out" a +/- 2.
But I may improve the commit message, to make that more clear.

>
> Better keep it simple and change only one instance (personally,
> I would have changed two letters in the longer one).

That is certainly doable.

>
> >
> > Reported-by: Alexander Meshcheryakov <alexander.s.m@xxxxxxxxx>
> > Signed-off-by: Torsten Bögershausen <tboegi@xxxxxx>
> > ---
> >  diff.c                 | 37 +++++++++++++++++++++++--------------
> >  t/t4012-diff-binary.sh | 14 +++++++-------
> >  2 files changed, 30 insertions(+), 21 deletions(-)
> >
> > diff --git a/diff.c b/diff.c
> > index 974626a621..cf38e1dc88 100644
> > --- a/diff.c
> > +++ b/diff.c
> > @@ -2591,7 +2591,7 @@ void print_stat_summary(FILE *fp, int files,
> >  static void show_stats(struct diffstat_t *data, struct diff_options *options)
> >  {
> >  	int i, len, add, del, adds = 0, dels = 0;
> > -	uintmax_t max_change = 0, max_len = 0;
> > +	uintmax_t max_change = 0, max_width = 0;
>
> Why rename `max_len`, but not `len`? I would have expected (and agreed
> with seeing) `len` to be renamed to `width`, too.

That is a valid point.
There is, however, already a variable called "width".
And renaming this one into a new one as well ?

>
> >  	int total_files = data->nr, count;
> >  	int width, name_width, graph_width, number_width = 0, bin_width = 0;
> >  	const char *reset, *add_c, *del_c;
> > @@ -2620,9 +2620,9 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
> >  			continue;
> >  		}
> >  		fill_print_name(file);
> > -		len = strlen(file->print_name);
> > -		if (max_len < len)
> > -			max_len = len;
> > +		len = utf8_strwidth(file->print_name);
> > +		if (max_width < len)
> > +			max_width = len;
> >
> >  		if (file->is_unmerged) {
> >  			/* "Unmerged" is 8 characters */
> > @@ -2646,7 +2646,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
> >
> >  	/*
> >  	 * We have width = stat_width or term_columns() columns total.
> > -	 * We want a maximum of min(max_len, stat_name_width) for the name part.
> > +	 * We want a maximum of min(max_width, stat_name_width) for the name part.
> >  	 * We want a maximum of min(max_change, stat_graph_width) for the +- part.
> >  	 * We also need 1 for " " and 4 + decimal_width(max_change)
> >  	 * for " | NNNN " and one the empty column at the end, altogether
> > @@ -2701,8 +2701,8 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
> >  		graph_width = options->stat_graph_width;
> >
> >  	name_width = (options->stat_name_width > 0 &&
> > -		      options->stat_name_width < max_len) ?
> > -		options->stat_name_width : max_len;
> > +		      options->stat_name_width < max_width) ?
> > +		options->stat_name_width : max_width;
>
> It is a bit sad that the diff lines regarding the renamed variable drown
> out the actual change (`strlen()` -> `utf8_strwidth()`). But the end
> result is nicer.
>
> Thank you for working on this!
> Dscho

Thanks so much for the review - let's see if I can make a better patch
the next days (better say weeks)