Re: git on MacOSX and files with decomposed utf-8 file names

"Martin Langhoff" <martin.langhoff@xxxxxxxxx> · Tue, 22 Jan 2008 10:06:05 +1300

On Jan 22, 2008 7:12 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> Now, git _also_ heavily depends on the actual encoding of those
> codepoints, since we create hashes etc, so in fact, as far ass git is
> concerned, names have to be in some particular encoding to be hashed, and
> UTF-8 is the only sane encoding for Unicode. People can blather about
> UCS-2 and UTF-16 and UTF-32 all they want, but the fact is, UTF-8 is
> simply technically superior in so many ways that I don't even understand
> why anybody ever uses anything else.
>
> So I would not disagree with using UTF-8 at all.

Linus,

(slightly offtopic) are you praising UTF-8 as storage format (for disk
and network) or in general? UTF-8-aware string ops like counting
characters seem to me a horrendous thing at the ASM level.

More on topic, I suspect Kevin's experience is more on end-user apps,
where input sanitization and even canonicalisation are common
practice. From a kernel and filesystems POV, a filename is data as
sacred as file data. On the webapp world, we "corrupt" user input
liberally to avoid XSS attacks and the like. In some cases, these
practices are stupid and can be replaced with escaping data properly,
but in other cases, the web platform is so broken that there's no
option.

At least in Moodle we store *exactly*  what the user POSTed and
cleanup^Wcorrupt it when displaying it, so that if it does happen that
the cleanup was buggy, we never corrupted the data.

So no point in calling eachother stupid this much. Once is enough ;-)
And no point in arguing that something that is ok for an end-user app
is a good design decision for an OS.

martin
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html