Re: git on MacOSX and files with decomposed utf-8 file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Tue, 22 Jan 2008, Martin Langhoff wrote:
> 
> (slightly offtopic) are you praising UTF-8 as storage format (for disk
> and network) or in general? UTF-8-aware string ops like counting
> characters seem to me a horrendous thing at the ASM level.

I'm praising UTF-8 (without normalization) as a wonderful format where you 
can do 99.9% of everything without ever caring about all the expensive 
stuff.

But in order to do that, you really need to avoid normalization, and you 
also need to accept mis-formed UTF-8 strings (because even if it is real 
UTF-8, the string may actually be just a fragment of some larger string).

Once you do that (and _only_ if you do that), then UTF-8 is actually a 
wonderful thing. You can consider it to be a traditional "everything is a 
stream of bytes", and everything that only cares about a stream of byte 
will work wonderfully well.

And then, the (actually relatively few) things that want to do things like 
show things on the screen, or check for equivalence, or worry about width 
of the characters, *those* can still do so. 

So the beauty of UTF-8 is that you can switch between thinking of it like 
just a binary blob and thinking of it like text, and everythign works 
(including the traditional C null-termination).

And yes, that was obviously the explicit design goal. It's a good thing.

> More on topic, I suspect Kevin's experience is more on end-user apps,
> where input sanitization and even canonicalisation are common
> practice.

Sure. And I'm not arguing against them. Knowing the rules for combining 
characters is really important for input and output. 

> At least in Moodle we store *exactly*  what the user POSTed and
> cleanup^Wcorrupt it when displaying it, so that if it does happen that
> the cleanup was buggy, we never corrupted the data.

Absolutely. It's what the kernel does, and I think that's what perl does 
too for their "strings". It works really well. It also allows you to 
handle binary data (ie data that *really* isn't text) with shared routines 
etc etc.

And that's the beauty of non-normalized (and possibly badly formed) UTF-8.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux