Re: git on MacOSX and files with decomposed utf-8 file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Fri, 18 Jan 2008, JM Ibanez wrote:
>
> With the exception of Unicode. If you check the standard, two Unicode
> codepoints (i.e. the numeric value that gets stored on disk) *can* map
> to the same character, hence they are the same.

But if you want to make it clear, you can use "encoded character" or yes, 
"code point". 

But the thing is, even the unicode standard tends to just say "character", 
and a unicode string (for example) is defined to be a sequence of "code 
units" which in turn is about those *encoded* characters, which is all 
about the code points.

So you'll find that they are very careful in some technical definition 
parts to talk about "code points", but then in other sequences they talk 
about "character" even though they are referring to the actual code point 
(ie the figure literally has the unicode number in it!)

In fact, they sometimes even talk about "characters" in the totally 
non-encoding meaning of "glyph".

So yes, "character" is often ambiguous. It would be good to never use the 
word at all, and only talk about "code point" and "glyph" and one of the 
well-defined special terms like "combining character" or "replacement 
character".

But to take a representative example from The Unicode Standard, Chapter 2: 
"Unicode Design Principles":

  Characters are represented by code points that reside only in a memory 
  representation, as strings in memory, on disk, or in data transmission. 
  The Unicode Standard deals only with character codes.

(any speling mistakes mine). In other words, from the very beginning of 
the standard, very basic design principles chapter, it starts talking 
about characters being represented by code points and explicitly says that 
it really only deals with CHARACTER CODES.

Yes, I'm sure you can argue ad infinitum that all the "equivalences" and 
other crap means that a "character" can sometimes mean just about 
anything, but I'd say that it's pretty damn reasonable to equate "unicode 
character" with "code point" or "character code".

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux