Re: [PATCH] builtin-blame.c: Use utf8_strwidth for author's names

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Mon, 2 Feb 2009 13:40:11 +0100 (CET)

Hi,

On Sun, 1 Feb 2009, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@xxxxxx> writes:
> 
> > On Fri, 30 Jan 2009, Geoffrey Thomas wrote:
> >
> >> Currently, however, printf("%*.*s", width, width, author) is simply 
> >> wrong, because printf only cares about bytes, not screen columns. Do 
> >> you think I should fall back on the old behavior if 
> >> i18n.commitencoding is set, or if at least one of the author names 
> >> isn't parseable as UTF-8, or something? Or should I be doing this 
> >> with iconv and assuming all commits are encoded in the current 
> >> encoding specified via $LANG or $LC_whatever?
> >
> > I do not know what encoding the author is at that point, but if you 
> > cannot be sure that it is UTF-8, using utf8_strwidth() is just as 
> > wrong as the current code, IMHO.
> 
> That is true, but then we are not losing anything.
> 
> This codepath is not about the payload (the contents of the files) but 
> the author name part of the commit log message, and UTF-8 would probably 
> be the only sensible encoding to standardize on.

Almost agree, except for shops where you have an enforced encoding that 
you cannot easily change.

And last time I checked, many more encodings used 1 character/byte (or for 
that matter, 1 column / byte) than not; utf8_width would be "more wrong" 
than strlen() here, because strlen() would "happen to work" here.

> If your project uses UTF-8 for everybody, great, we will align them 
> better than we did before.  If not, sorry, you will get a different 
> misaligned names.

There _has_ to be a way to check if the current author string is encoded 
in UTF-8.  All I am asking is that the original poster would put just a 
_little_ more effort into the issue and make the thing dependent on the 
knowledge -- as opposed to the assumption -- that the author is encoded in 
UTF-8.

> That assumes utf8_width() does not barf when fed an invalid byte 
> sequence, but I did not think it is that fragile (I didn't actually 
> audit the codepath, though).

That is the code that barfs in wcwidth:

        if (ch < 32 || (ch >= 0x7f && ch < 0xa0))
                return -1;

That is not a big problem, but Geoff's code does not handle that case 
correctly.  For example, in Code-page 437, a name like "£ïñûç" would 
result in a negative width.

But hey, it is definitely not my itch, I will never suffer from the 
fallout of this patch, as I am safely within US-ASCII.  I just thought I 
saw a potential problem and a possible way out.  That's it.

Ciao,
Dscho