On 16/04/2014 07:48, Torsten Bögershausen wrote:
On 15.04.14 21:10, Peter Krefting wrote:
Torsten Bögershausen:
diff --git a/utf8.c b/utf8.c
index a831d50..77c28d4 100644
--- a/utf8.c
+++ b/utf8.c
Is there a script that generates this code from the Unicode database files, or did you hand-update it?
Some of the code points which have "0 length on the display" are called
"combining", others are called "vowels" or "accents".
E.g. 5BF is not marked any of them, but if you look at the glyph, it should
be combining (please correct me if that is wrong).
Indeed it is combining (more specifically it has General Category
"Nonspacing_Mark" = "Mn").
If I could have found a file which indicates for each code point, what it
is, I could write a script.
The most complete and machine-readable data are in these files:
http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
The general categories can also be seen more legibly in:
http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt
For docs, see:
http://www.unicode.org/reports/tr44/
http://www.unicode.org/reports/tr11/
http://www.unicode.org/ucd/
The existing utf8.c comments describe the attributes being selected from
the tables (general categories "Cf","Mn","Me", East Asian Width "W",
"F"). And they suggest that the combining character table was originally
auto-generated from UnicodeData.txt with a "uniset" tool. Presumably this?
https://github.com/depp/uniset
The fullwidth-checking code looks like it was done by hand, although
apparently uniset can process EastAsianWidth.txt.
Kevin
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html