Re: [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



tboegi@xxxxxx writes:

> From: Torsten Bögershausen <tboegi@xxxxxx>
> Subject: Re: [PATCH v4 1/2] diff.c: When appropriate, use utf8_strwidth(), part1

Given 2/2 does not share a similar title, "part1" sounds somewhat
strange.  In any case, 'when appropriate,' is probalby best unsaid,
as it is almost a given.  We won't deliberately use something that
is not appropriate on purpose anyway.  Even if we =were to keep that
word, downcase "When".


> When unicode filenames (encoded in UTF-8) are used, the visible width
> on the screen is not the same as strlen(filename).
>
> For example, `git log --stat` may produce an output like this:
>
> [snip the header]
>
>  Arger.txt  | 1 +
>  Ärger.txt | 1 +
>  2 files changed, 2 insertions(+)
>
> A side note: the original report was about cyrillic filenames.
> After some investigations it turned out that
> a) This is not a problem with "ambiguous characters" in unicode
> b) The same problem exists for all unicode code points (so we
>   can use Latin based Umlauts for demonstrations below)
>
> The 'Ä' takes the same space on the screen as the 'A'.
> But needs one more byte in memory, so the the `git log --stat` output
> for "Arger.txt" (!) gets mis-aligned:
> The maximum length is derived from "Ärger.txt", 10 bytes in memory,
> 9 positions on the screen. That is why "Arger.txt" gets one extra ' '
> for aligment, it needs 9 bytes in memory.
> If there was a file "Ö", it would be correctly aligned by chance,
> but "Öhö" would not.
>
> The solution is of course, to use utf8_strwidth() instead of strlen()
> when dealing with the width on screen.
>
> Side note 1:
> Needed changes for this fix are split into 2 commits:
> This commit only changes strlen() into utf8_strwidth() in diff.c:
> The next commit will add tests and further needed changes.

I am not sure if it makes sense to split them into two.  It is hard
for us to demonistrate the need for this step if it does not come
with its own test.

> Side note 2:
> Junio C Hamano suspects that there is probably more work to be done,
> in a separate commit:
> Code in diff.c::pprint_rename() that "abbreviates" overly long pathnames
> and "transforms" renames lines like
> "a/b/c -> a/B/c" into the shorter
> "a/{b->B}/c" form, and IIRC this is all byte based.

I already said that I suspect {b->B} conversion is OK, so the side
note is probably more noise than being useful.
>
> Reported-by: Alexander Meshcheryakov <alexander.s.m@xxxxxxxxx>
> Signed-off-by: Torsten Bögershausen <tboegi@xxxxxx>
> ---
>  diff.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/diff.c b/diff.c
> index 974626a621..b5df464de5 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -2620,7 +2620,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  			continue;
>  		}
>  		fill_print_name(file);
> -		len = strlen(file->print_name);
> +		len = utf8_strwidth(file->print_name);
>  		if (max_len < len)
>  			max_len = len;
>
> @@ -2743,7 +2743,7 @@ static void show_stats(struct diffstat_t *data, struct diff_options *options)
>  		 * "scale" the filename
>  		 */
>  		len = name_width;
> -		name_len = strlen(name);
> +		name_len = utf8_strwidth(name);
>  		if (name_width < name_len) {
>  			char *slash;
>  			prefix = "...";
> --
> 2.34.0



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux