Torsten Bögershausen <tboegi@xxxxxx> writes: > The function git_wcwidth() returns for a given unicode code point the > width on the display: > -1 for control characters, > 0 for combining or other non-visible code points > 1 for e.g. ASCII > 2 for double-width code points. > > This table had been originally been extracted for one Unicode version, > probably 3.2. > > Make it possible to update the to a later version of Unicode: > > Factor out the table from utf8.c into unicode_width.h and > add the script update_unicode.sh to update the table based on the latest > Unicode specification files. > > Thanks to > Peter Krefting <peter@xxxxxxxxxxxxxxxx> and > Kevin Bracey <kevin@xxxxxxxxx> > for helping with their Unicode knowledge > > Signed-off-by: Torsten Bögershausen <tboegi@xxxxxx> > --- I would say this is a good idea, but a few nitpicks. > diff --git a/unicode_width.h b/unicode_width.h > new file mode 100644 > index 0000000..4db7803 > --- /dev/null > +++ b/unicode_width.h > @@ -0,0 +1,288 @@ Please update the script (and the resulting file) to caution against misuse/mismanagement of this file by adding a comment to at least state: - This is a generated file and you are not supposed to modify it; and - This is to be included only once from one place in the code and that is why this does not use the usual #ifndef X/#define X/#endif double-inclusion guards. An obvious and viable alternative to the second would be to do the usual double-inclusion guard. I do not have much preference either way. > +static const struct interval zero_width[] = { > ... > +}; > +static const struct interval double_width[] = { > ... > +}; > diff --git a/update_unicode.sh b/update_unicode.sh > new file mode 100755 > index 0000000..000b937 > --- /dev/null > +++ b/update_unicode.sh > @@ -0,0 +1,37 @@ > +#!/bin/sh > +#See http://www.unicode.org/reports/tr44/ > +# > +#Me Enclosing_Mark an enclosing combining mark > +#Mn Nonspacing_Mark a nonspacing combining mark (zero advance width) > +#Cf Format a format control character Please have a SP after # in these comments to make them readable? > +# > +UNICODEWIDTH_H=../unicode_width.h > +if ! test -d unicode; then > + mkdir unicode > +fi && Style: if ! test -d unicode then ... fi > +( cd unicode && > + if ! test -f UnicodeData.txt; then > + wget http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt > + fi && > + if ! test -f EastAsianWidth.txt; then > + wget http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt > + fi && > + if ! test -d uniset; then > + git clone https://github.com/depp/uniset.git > + fi && > + ( > + cd uniset && > + if ! test -x uniset; then > + autoreconf -i && > + ./configure --enable-warnings=-Werror CFLAGS='-O0 -ggdb' What are these "-O0 -ggdb" about? > + fi && > + make > + ) && > + echo "static const struct interval zero_width[] = {" >$UNICODEWIDTH_H && > + UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn,Cf + U+1160..U+11FF - U+00AD | > + grep -v plane >>$UNICODEWIDTH_H && > + echo "};" >>$UNICODEWIDTH_H && > + echo "static const struct interval double_width[] = {" >>$UNICODEWIDTH_H && > + UNICODE_DIR=. ./uniset/uniset --32 eaw:F,W >>$UNICODEWIDTH_H && > + echo "};" >>$UNICODEWIDTH_H > +) > @@ -147,7 +90,7 @@ static int git_wcwidth(ucs_char_t ch) > return -1; > > /* binary search in table of non-spacing characters */ > - if (bisearch(ch, combining, sizeof(combining) > + if (bisearch(ch, zero_width, sizeof(zero_width) > / sizeof(struct interval) - 1)) To my eyes, that line looks folded at a funny place. I think it is more conventional to fold after the operator, i.e. if (bisearch(ch, zero_width, sizeof(zero_width) / sizeof(struct interval) - 1)) but if (bisearch(ch, zero_width, sizeof(zero_width) / sizeof(struct interval) - 1)) may probably be a lot more logical and readable. Maybe it is just me. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html