Re: [PATCH 1/2] utf8: change type from int to size_t

Patrick Steinhardt <ps@xxxxxx> · Thu, 21 Mar 2024 11:30:19 +0100

On Sun, Mar 17, 2024 at 08:27:50AM +0000, Mohit Marathe via GitGitGadget wrote:
> From: Mohit Marathe <mohitmarathe@xxxxxxxxx>
> 
> This update modifies the variable types that are used in calls to the
> `utf8_strnwidth` and `utf8_strwidth` functions. This modification is
> a proactive measure in anticipation of an upcoming interface change
> to these functions in the subsequent patch.

It would help to mention `utf8_str{,n}width()` in the commit subject so
that it's immediately obvious which functions get touched by this
commit.

> Signed-off-by: Mohit Marathe <mohitmarathe@xxxxxxxxx>
> ---
>  builtin/blame.c    | 6 +++---
>  builtin/branch.c   | 2 +-
>  builtin/fetch.c    | 2 +-
>  builtin/worktree.c | 4 ++--
>  column.c           | 2 +-
>  diff.c             | 8 +++++---
>  gettext.c          | 2 +-
>  gettext.h          | 2 +-
>  pretty.c           | 4 ++--
>  utf8.c             | 4 ++--
>  wt-status.c        | 4 ++--
>  11 files changed, 21 insertions(+), 19 deletions(-)
> 
> diff --git a/builtin/blame.c b/builtin/blame.c
> index db1f56de61a..a72e2552305 100644
> --- a/builtin/blame.c
> +++ b/builtin/blame.c
> @@ -53,7 +53,7 @@ static const char *annotate_opt_usage[] = {
>  };
>  
>  static int longest_file;

We also end up assigning `longest_file = num`, where `num` became a
`size_t` now. We should likely change this variable's type to `size_t`
as well.

> -static int longest_author;
> +static size_t longest_author;
>  static int max_orig_digits;
>  static int max_digits;
>  static int max_score_digits;
> @@ -529,7 +529,7 @@ static void emit_other(struct blame_scoreboard *sb, struct blame_entry *ent, int
>  					name = ci.author_mail.buf;
>  				else
>  					name = ci.author.buf;
> -				pad = longest_author - utf8_strwidth(name);
> +				pad = cast_size_t_to_int(longest_author - utf8_strwidth(name));

I feel like this cast here is somewhat dubious. Both `longest_author`
and the return value of `utf8_strwidth()` are `size_t` now. So an
alternative way to handle this would be to assign the return value to a
temporary variable and then `if (longest_author > name_len) BUG(...)`.

>  				printf(" (%s%*s %10s",
>  				       name, pad, "",
>  				       format_time(ci.author_time,
> @@ -631,7 +631,7 @@ static void find_alignment(struct blame_scoreboard *sb, int *option)

A few lines before this hunk we declare `longest_src_lines` and
`longest_dst_lines` as `int`, but we assign `num` to them. We should fix
those to become `size_t`, as well. This will bubble up so that we also
have to fix `max_orig_digits` and `max_digits`.

>  	for (e = sb->ent; e; e = e->next) {
>  		struct blame_origin *suspect = e->suspect;
> -		int num;
> +		size_t num;
>  
>  		if (compute_auto_abbrev)
>  			auto_abbrev = update_auto_abbrev(auto_abbrev, suspect);
> diff --git a/builtin/branch.c b/builtin/branch.c
> index 8c2305ad2c8..321c3558f2d 100644
> --- a/builtin/branch.c
> +++ b/builtin/branch.c
> @@ -349,7 +349,7 @@ static int calc_maxwidth(struct ref_array *refs, int remote_bonus)
>  	for (i = 0; i < refs->nr; i++) {
>  		struct ref_array_item *it = refs->items[i];
>  		const char *desc = it->refname;
> -		int w;
> +		size_t w;

This isn't sufficient as we end up assignign `max = w`, where `max` is
declared as `int`. We would thus also have to change that variable and
then change the return value of `calc_maxwidth()` itself to be `size_t`.

Furthermore, we do `w += remote_bonus` in this function, where
`remote_bonus` is of type `int`. Is it guaranteed that this variable is
always positive? If so, it should likely become a `size_t`, too.
Otherwise we'll need to have some safety checks that the result does not
wrap.

>  		skip_prefix(it->refname, "refs/heads/", &desc);
>  		skip_prefix(it->refname, "refs/remotes/", &desc);
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index 46a793411a4..fee992c3c14 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -705,7 +705,7 @@ static int refcol_width(const struct ref *ref_map, int compact_format)
>  		max = max * 2 / 3;
>  
>  	for (ref = ref_map; ref; ref = ref->next) {
> -		int rlen, llen = 0, len;
> +		size_t rlen, llen = 0, len;

We assign `width = rlen` further down, but `width` is an `int`. It
should likely be changed to become a `size_t` as well, which will
require us to change the return value of this function.

>  		if (ref->status == REF_STATUS_REJECT_SHALLOW ||
>  		    !ref->peer_ref ||
> diff --git a/builtin/worktree.c b/builtin/worktree.c
> index 9c76b62b02d..bdbf46fb658 100644
> --- a/builtin/worktree.c
> +++ b/builtin/worktree.c
> @@ -961,8 +961,8 @@ static void show_worktree_porcelain(struct worktree *wt, int line_terminator)
>  static void show_worktree(struct worktree *wt, int path_maxlen, int abbrev_len)
>  {
>  	struct strbuf sb = STRBUF_INIT;
> -	int cur_path_len = strlen(wt->path);
> -	int path_adj = cur_path_len - utf8_strwidth(wt->path);
> +	size_t cur_path_len = strlen(wt->path);
> +	int path_adj = cast_size_t_to_int(cur_path_len - utf8_strwidth(wt->path));
>  	const char *reason;

I was wondering whether there should be a sanity check here for whether
`cur_path_len >= path_adj`. But given that both measure the length of
`wt->path`, and given that `utf8_strwidth()` should only ever return a
value smaller than the original length, I think this should be fine.

>  	strbuf_addf(&sb, "%-*s ", 1 + path_maxlen + path_adj, wt->path);
> diff --git a/column.c b/column.c
> index 50bbccc92ee..ec874036de6 100644
> --- a/column.c
> +++ b/column.c
> @@ -22,7 +22,7 @@ struct column_data {
>  };
>  
>  /* return length of 's' in letters, ANSI escapes stripped */
> -static int item_length(const char *s)
> +static size_t item_length(const char *s)
>  {
>  	return utf8_strnwidth(s, strlen(s), 1);
>  }

The value of this function is assigned to `struct column_data::len`,
which is an array of `int`. That type needs to change.

> diff --git a/diff.c b/diff.c
> index e50def45383..4faf151345a 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2629,7 +2629,8 @@ void print_stat_summary(FILE *fp, int files,
>  
>  static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  {
> -	int i, len, add, del, adds = 0, dels = 0;
> +	int i, add, del, adds = 0, dels = 0;
> +	size_t len = 0;
>  	uintmax_t max_change = 0, max_len = 0;

It's somewhat weird that `max_len` is declared as `uintmax_t`. Should it
become a `size_t`, as well?

>  	int total_files = data->nr, count;
>  	int width, name_width, graph_width, number_width = 0, bin_width = 0;
> @@ -2780,7 +2781,8 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  		char *name = file->print_name;
>  		uintmax_t added = file->added;
>  		uintmax_t deleted = file->deleted;
> -		int name_len, padding;
> +		size_t name_len;

We compare `name_len` with `name_width`. The latter should likely become
a `size_t`, as well.

> +		int padding;
>  
>  		if (!file->is_interesting && (added + deleted == 0))
>  			continue;
> @@ -2809,7 +2811,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			if (slash)
>  				name = slash;
>  		}
> -		padding = len - utf8_strwidth(name);
> +		padding = cast_size_t_to_int(len - utf8_strwidth(name));

I think we shouldn't abuse `cast_size_t_to_int()` like this but instead
explicitly check whether `len >= len(name)`.

>  		if (padding < 0)
>  			padding = 0;
>  
> diff --git a/gettext.c b/gettext.c
> index 57facbc21ec..5a77b4f7202 100644
> --- a/gettext.c
> +++ b/gettext.c
> @@ -127,7 +127,7 @@ void git_setup_gettext(void)
>  }
>  
>  /* return the number of columns of string 's' in current locale */
> -int gettext_width(const char *s)
> +size_t gettext_width(const char *s)

We also need to adjust the callsite of this function. There's only a
single one in "builtin/fetch.c::display_ref_update()".

>  {
>  	static int is_utf8 = -1;
>  	if (is_utf8 == -1)
> diff --git a/gettext.h b/gettext.h
> index 484cafa5628..f161a21b45c 100644
> --- a/gettext.h
> +++ b/gettext.h
> @@ -31,7 +31,7 @@
>  #ifndef NO_GETTEXT
>  extern int git_gettext_enabled;
>  void git_setup_gettext(void);
> -int gettext_width(const char *s);
> +size_t gettext_width(const char *s);
>  #else
>  #define git_gettext_enabled (0)
>  static inline void git_setup_gettext(void)
> diff --git a/pretty.c b/pretty.c
> index bdbed4295aa..f03493c74b1 100644
> --- a/pretty.c
> +++ b/pretty.c
> @@ -1781,7 +1781,7 @@ static size_t format_and_pad_commit(struct strbuf *sb, /* in UTF-8 */
>  
>  	if (padding < 0) {
>  		const char *start = strrchr(sb->buf, '\n');
> -		int occupied;
> +		size_t occupied;
>  		if (!start)
>  			start = sb->buf;
>  		occupied = utf8_strnwidth(start, strlen(start), 1);

We're assigning `padding = (-padding) - occupied` next, where `padding`
is of type `int`. We should add some safety checks here.

> @@ -1802,7 +1802,7 @@ static size_t format_and_pad_commit(struct strbuf *sb, /* in UTF-8 */
>  		placeholder++;
>  		total_consumed++;
>  	}
> -	len = utf8_strnwidth(local_sb.buf, local_sb.len, 1);
> +	len = cast_size_t_to_int(utf8_strnwidth(local_sb.buf, local_sb.len, 1));

Why did you decide to use `cast_size_t_to_int()` instead of converting
`len` to be of type `size_t`?

>  	if (c->flush_type == flush_left_and_steal) {
>  		const char *ch = sb->buf + sb->len - 1;
> diff --git a/utf8.c b/utf8.c
> index 6bfaefa28eb..8ccdf684e07 100644
> --- a/utf8.c
> +++ b/utf8.c
> @@ -466,7 +466,7 @@ int utf8_fprintf(FILE *stream, const char *format, ...)
>  
>  	columns = fputs(buf.buf, stream);
>  	if (0 <= columns) /* keep the error from the I/O */
> -		columns = utf8_strwidth(buf.buf);
> +		columns = cast_size_t_to_int(utf8_strwidth(buf.buf));
>  	strbuf_release(&buf);
>  	return columns;
>  }
> @@ -806,7 +806,7 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid
>  		       const char *s)
>  {
>  	size_t slen = strlen(s);
> -	int display_len = utf8_strnwidth(s, slen, 0);
> +	size_t display_len = utf8_strnwidth(s, slen, 0);
>  	int utf8_compensation = slen - display_len;
>  
>  	if (display_len >= width) {

A few lines further down we have:

    int left = (width - dispay_len) / 2;

This should be changed to become a `size_t`.

> diff --git a/wt-status.c b/wt-status.c
> index 7108a92b52c..c847b4bb5ed 100644
> --- a/wt-status.c
> +++ b/wt-status.c
> @@ -325,7 +325,7 @@ static int maxwidth(const char *(*label)(int), int minval, int maxval)
>  
>  	for (i = minval; i <= maxval; i++) {
>  		const char *s = label(i);
> -		int len = s ? utf8_strwidth(s) : 0;
> +		size_t len = s ? utf8_strwidth(s) : 0;
>  		if (len > result)
>  			result = len;

`result` is of type `int`, so we'd have to change both its type as well
as the return type of `maxwidth()`.

>  	}
> @@ -341,7 +341,7 @@ static void wt_longstatus_print_unmerged_data(struct wt_status *s,
>  	static char *padding;
>  	static int label_width;
>  	const char *one, *how;
> -	int len;
> +	size_t len;
>  
>  	if (!padding) {
>  		label_width = maxwidth(wt_status_unmerged_status_string, 1, 7);

`len` gets passed to the formatter "%.*s". Quoting fprintf(3p):

    A field width, or precision, or both, may be indicated by an
    <asterisk> ('*'). In this case an argument of type int supplies the
    field width or precision.

So this must be an `int` or otherwise bad things will happen depending
on the size of those types on your platform.

Patrick
Attachment:
signature.asc

Description: PGP signature