Re: [PATCH 2/2] Make it possible to update git_wcwidth()

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 12 May 2014 10:54:41 -0700

Torsten Bögershausen <tboegi@xxxxxx> writes:

> The function git_wcwidth() returns for a given unicode code point the
> width on the display:
> -1 for control characters,
>  0 for combining or other non-visible code points
>  1 for e.g. ASCII
>  2 for double-width code points.
>
> This table had been originally been extracted for one Unicode version,
> probably 3.2.
>
> Make it possible to update the to a later version of Unicode:
>
> Factor out the table from utf8.c into unicode_width.h and
> add the script update_unicode.sh to update the table based on the latest
> Unicode specification files.
>
> Thanks to
> Peter Krefting <peter@xxxxxxxxxxxxxxxx> and
> Kevin Bracey <kevin@xxxxxxxxx>
> for helping with their Unicode knowledge
>
> Signed-off-by: Torsten Bögershausen <tboegi@xxxxxx>
> ---

I would say this is a good idea, but a few nitpicks.

> diff --git a/unicode_width.h b/unicode_width.h
> new file mode 100644
> index 0000000..4db7803
> --- /dev/null
> +++ b/unicode_width.h
> @@ -0,0 +1,288 @@

Please update the script (and the resulting file) to caution against
misuse/mismanagement of this file by adding a comment to at least
state:

 - This is a generated file and you are not supposed to modify it; and

 - This is to be included only once from one place in the code and
   that is why this does not use the usual #ifndef X/#define X/#endif
   double-inclusion guards.

An obvious and viable alternative to the second would be to do the
usual double-inclusion guard.  I do not have much preference either
way.

> +static const struct interval zero_width[] = {
> ...
> +};
> +static const struct interval double_width[] = {
> ...
> +};
> diff --git a/update_unicode.sh b/update_unicode.sh
> new file mode 100755
> index 0000000..000b937
> --- /dev/null
> +++ b/update_unicode.sh
> @@ -0,0 +1,37 @@
> +#!/bin/sh
> +#See http://www.unicode.org/reports/tr44/
> +#
> +#Me Enclosing_Mark  an enclosing combining mark
> +#Mn Nonspacing_Mark a nonspacing combining mark (zero advance width)
> +#Cf Format          a format control character

Please have a SP after # in these comments to make them readable?

> +#
> +UNICODEWIDTH_H=../unicode_width.h
> +if ! test -d unicode; then
> +	mkdir unicode
> +fi &&

Style:

	if ! test -d unicode
        then
        	...
	fi

> +( cd unicode &&
> +	if ! test -f UnicodeData.txt; then
> +		wget http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
> +	fi &&
> +	if ! test -f EastAsianWidth.txt; then
> +		wget http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
> +	fi &&
> +	if ! test -d uniset; then
> +		git clone https://github.com/depp/uniset.git
> +	fi &&
> +	(
> +		cd uniset &&
> +		if ! test -x uniset; then
> +			autoreconf -i &&
> +			./configure --enable-warnings=-Werror CFLAGS='-O0 -ggdb'

What are these "-O0 -ggdb" about?

> +		fi &&
> +		make
> +	) &&
> +	echo "static const struct interval zero_width[] = {" >$UNICODEWIDTH_H &&
> +	UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn,Cf + U+1160..U+11FF - U+00AD |
> +	grep -v plane >>$UNICODEWIDTH_H &&
> +	echo "};" >>$UNICODEWIDTH_H &&
> +	echo "static const struct interval double_width[] = {" >>$UNICODEWIDTH_H &&
> +	UNICODE_DIR=. ./uniset/uniset --32 eaw:F,W >>$UNICODEWIDTH_H &&
> +	echo "};" >>$UNICODEWIDTH_H
> +)
> @@ -147,7 +90,7 @@ static int git_wcwidth(ucs_char_t ch)
>  		return -1;
>  
>  	/* binary search in table of non-spacing characters */
> -	if (bisearch(ch, combining, sizeof(combining)
> +	if (bisearch(ch, zero_width, sizeof(zero_width)
>  				/ sizeof(struct interval) - 1))

To my eyes, that line looks folded at a funny place.  I think it is
more conventional to fold after the operator, i.e.

	if (bisearch(ch, zero_width, sizeof(zero_width)	/
				sizeof(struct interval) - 1))

but

	if (bisearch(ch, zero_width,
  		     sizeof(zero_width)	/ sizeof(struct interval) - 1))

may probably be a lot more logical and readable.  Maybe it is just
me.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html