Re: [PATCH] Unicode: update of combining code points

Kevin Bracey <kevin@xxxxxxxxx> · Wed, 16 Apr 2014 13:51:43 +0300

On 16/04/2014 07:48, Torsten Bögershausen wrote:
On 15.04.14 21:10, Peter Krefting wrote:
Torsten Bögershausen:

diff --git a/utf8.c b/utf8.c
index a831d50..77c28d4 100644
--- a/utf8.c
+++ b/utf8.c
Is there a script that generates this code from the Unicode database files, or did you hand-update it?

Some of the code points which have "0 length on the display" are called
"combining", others are called "vowels" or "accents".
E.g. 5BF is not marked any of them, but if you look at the glyph, it should
be combining (please correct me if that is wrong).

Indeed it is combining (more specifically it has General Category 
"Nonspacing_Mark" = "Mn").

If I could have found a file which indicates for each code point, what it
is, I could write a script.

The most complete and machine-readable data are in these files:

http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt

The general categories can also be seen more legibly in:

http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt

For docs, see:

http://www.unicode.org/reports/tr44/
http://www.unicode.org/reports/tr11/
http://www.unicode.org/ucd/

The existing utf8.c comments describe the attributes being selected from 
the tables (general categories "Cf","Mn","Me", East Asian Width "W", 
"F"). And they suggest that the combining character table was originally 
auto-generated from UnicodeData.txt with a "uniset" tool. Presumably this?

https://github.com/depp/uniset

The fullwidth-checking code looks like it was done by hand, although 
apparently uniset can process EastAsianWidth.txt.

Kevin

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html