On Oct 22, 2010, at 9:18 AM, Jonathan Nieder wrote:
Drew Northup wrote:
This is part of my ongoing work to treat UTF-16 as text (in
other words, the crlf options will work and .gitattributes hacks
won't
be required to display diffs, etc).
I would like to see the same thing for MacRoman-encoded text.[1] This
is the encoding used by classic Mac development tools such as
Metrowerks C/C++ (packaged as CodeWarrior) and Apple's Rez resource
compiler (even the version in OS X). Clearly, UTF-8 checkouts are not
an option here.
What's wrong with .gitattributes for this use case? I would think a
clean/smudge filter would produce very good behavior from most git
commands.
I wrote a Mac<->UTF-8 converter in C++ and set it as the clean/smudge
filter for .r (Rez) files. Checkouts were noticeably slower (on a
real machine, not one of my antiques). This would be much worse if I
also applied it to C and C++ source files (most, but not all, of which
are ASCII anyway).
If speed is the issue, maybe a built-in clean/smudge filter would
address that?
While the performance cost could be overlooked, a worse problem
occurred when I checked out a branch into which the conversion of
files from MacRoman to UTF-8 hadn't occurred. It automatically
dirtied my working tree, requiring me to temporarily disable the
filter attribute and reset --hard. I also resorted to checkout -f a
number of times -- a bad habit, I'm sure.
In the end I concluded that (a) these files are definitely text, and
(b) they are natively MacRoman and should be stored that way. There
is no advantage to using UTF-8 since the tools can't handle it, and
even were one to write a UTF-8-capable Rez compiler, the resources it
outputs are still MacRoman-encoded, so no Unicode support is possible.
Finally, (c) the end-to-end principle applies. Don't convert data en
route, but wait until it's necessary. Premature conversion was the
curse of FTP; let's not repeat it. But Git should definitely convert
data to match the encoding of the display device; writing anything but
valid UTF-8 to a UTF-8 terminal is in error. The same applies in gitk.
Another issue (on which my thoughts are less clear) is the use of CR
newlines. CR is also native to classic Mac OS, but in contrast to
UTF-8, Mac developer tools are generally newline-agnostic, whereas
typical Unix programs are less forgiving in expecting LF -- so I've
been using linefeeds in my source code. However, it might be useful
to retain platform-customary newlines for the purpose of guessing non-
UTF character encodings: Carriage returns would almost certainly
indicate MacRoman rather than ISO-8859-1. But a more complete and
robust solution would be to store the encoding somewhere, possibly in
the blob itself, or in the tree storing the filename.
Josh
[1] MacRoman is an extended ASCII character set native to classic Mac
OS. <http://en.wikipedia.org/wiki/Mac_OS_Roman>
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html