Re: [PATCH] Unicode: update of combining code points

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2014-04-16 12.51, Kevin Bracey wrote:
> On 16/04/2014 07:48, Torsten Bögershausen wrote:
>> On 15.04.14 21:10, Peter Krefting wrote:
>>> Torsten Bögershausen:
>>>
>>>> diff --git a/utf8.c b/utf8.c
>>>> index a831d50..77c28d4 100644
>>>> --- a/utf8.c
>>>> +++ b/utf8.c
>>> Is there a script that generates this code from the Unicode database files, or did you hand-update it?
>>>
>> Some of the code points which have "0 length on the display" are called
>> "combining", others are called "vowels" or "accents".
>> E.g. 5BF is not marked any of them, but if you look at the glyph, it should
>> be combining (please correct me if that is wrong).
> 
> Indeed it is combining (more specifically it has General Category "Nonspacing_Mark" = "Mn").
> 
>>
>> If I could have found a file which indicates for each code point, what it
>> is, I could write a script.
>>
> 
> The most complete and machine-readable data are in these files:
> 
> http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
> http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
> 
> The general categories can also be seen more legibly in:
> 
> http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt
> 
> For docs, see:
> 
> http://www.unicode.org/reports/tr44/
> http://www.unicode.org/reports/tr11/
> http://www.unicode.org/ucd/
> 
> The existing utf8.c comments describe the attributes being selected from the tables (general categories "Cf","Mn","Me", East Asian Width "W", "F"). And they suggest that the combining character table was originally auto-generated from UnicodeData.txt with a "uniset" tool. Presumably this?
> 
> https://github.com/depp/uniset
> 
> The fullwidth-checking code looks like it was done by hand, although apparently uniset can process EastAsianWidth.txt.
> 
> Kevin
Excellent, thanks for the pointers.
Running the script below shows that 
"0X00AD SOFT HYPHEN" should have zero length (and some others too).
I wonder if that is really the case, and which one of the last 2 lines 
in the script is the right one.

What does this mean for us:
"Cf 	Format 	a format control character"


#!/bin/sh

if ! test -f UnicodeData.txt; then
  wget http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
fi &&
if ! test -f EastAsianWidth.txt; then
  wget http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
fi
if ! test -f DerivedGeneralCategory.txt; then
  wget http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt
fi &&
if ! test -d uniset; then
  git clone https://github.com/tboegi/uniset.git
fi &&
(
  cd uniset &&
  if ! test -x uniset; then 
    autoreconf -i &&
    ./configure --enable-warnings=-Werror CFLAGS='-O0 -ggdb'
  fi &&
  make
) &&
UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn,Cf
#UNICODE_DIR=. ./uniset/uniset --32 cat:Me,Mn










> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]