Re: [PATCH] Unicode: update of combining code points

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 16/04/2014 22:58, Torsten Bögershausen wrote:
Excellent, thanks for the pointers.
Running the script below shows that
"0X00AD SOFT HYPHEN" should have zero length (and some others too).
I wonder if that is really the case, and which one of the last 2 lines
in the script is the right one.

What does this mean for us:
"Cf 	Format 	a format control character"

Maybe dig back through the Git logs to check the original logic, but the comments suggest that "Cf" characters have been viewed as zero-width. That makes sense - they're usually markers indicating things like bidirectional text flow, so won't be taking space. (Although they may be causing even more extreme layout effects...)

Soft-hyphen is noted as an explicit exception to the rule in the utf8.c comments. As of Unicode 4.0, it's supposed to be a character indicating a point where a hyphen could be placed if a line-wrap occurs, and if that wrap happens, then it can actually take up 1 space, otherwise not. So its width could be either 0 or 1, depending. Or, quite likely, the terminal doesn't treat it specially, and it always just looks like a hyphen... Thus we err on the safe side and give it width 1.

See http://en.wikipedia.org/wiki/Soft_hyphen for background.

The comments suggest adding "-00AD +1160-11FF" to the uniset command line for that tweak and for composing Hangul. (The +200B tweak isn't necessary any more - Zero-Width Space U+200B became Cf officially in Unicode 4.0.1:

http://en.wikipedia.org/wiki/Zero-width_space
http://www.unicode.org/review/resolved-pri.html#pri21
)

All of this is only really an approximation - a best-effort attempt to figure out the width of a string without any actual communication with the display device. So it'll never be perfect. The choice between double and single width in particular will often be unpredictable, unless you had deeper locale knowledge.

Actually, while doing this, I've realised that this was originally Markus Kuhn's implementation, and that is acknowledged at the top of the file:

http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

Good, because he knows what he's doing.

Kevin




--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]