Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Oct 22, 2010, at 9:18 AM, Jonathan Nieder wrote:

Drew Northup wrote:

       This is part of my ongoing work to treat UTF-16 as text (in
other words, the crlf options will work and .gitattributes hacks won't
be required to display diffs, etc).

I would like to see the same thing for MacRoman-encoded text.[1] This is the encoding used by classic Mac development tools such as Metrowerks C/C++ (packaged as CodeWarrior) and Apple's Rez resource compiler (even the version in OS X). Clearly, UTF-8 checkouts are not an option here.

What's wrong with .gitattributes for this use case?  I would think a
clean/smudge filter would produce very good behavior from most git
commands.

I wrote a Mac<->UTF-8 converter in C++ and set it as the clean/smudge filter for .r (Rez) files. Checkouts were noticeably slower (on a real machine, not one of my antiques). This would be much worse if I also applied it to C and C++ source files (most, but not all, of which are ASCII anyway).

If speed is the issue, maybe a built-in clean/smudge filter would
address that?

While the performance cost could be overlooked, a worse problem occurred when I checked out a branch into which the conversion of files from MacRoman to UTF-8 hadn't occurred. It automatically dirtied my working tree, requiring me to temporarily disable the filter attribute and reset --hard. I also resorted to checkout -f a number of times -- a bad habit, I'm sure.

In the end I concluded that (a) these files are definitely text, and (b) they are natively MacRoman and should be stored that way. There is no advantage to using UTF-8 since the tools can't handle it, and even were one to write a UTF-8-capable Rez compiler, the resources it outputs are still MacRoman-encoded, so no Unicode support is possible.

Finally, (c) the end-to-end principle applies. Don't convert data en route, but wait until it's necessary. Premature conversion was the curse of FTP; let's not repeat it. But Git should definitely convert data to match the encoding of the display device; writing anything but valid UTF-8 to a UTF-8 terminal is in error. The same applies in gitk.

Another issue (on which my thoughts are less clear) is the use of CR newlines. CR is also native to classic Mac OS, but in contrast to UTF-8, Mac developer tools are generally newline-agnostic, whereas typical Unix programs are less forgiving in expecting LF -- so I've been using linefeeds in my source code. However, it might be useful to retain platform-customary newlines for the purpose of guessing non- UTF character encodings: Carriage returns would almost certainly indicate MacRoman rather than ISO-8859-1. But a more complete and robust solution would be to store the encoding somewhere, possibly in the blob itself, or in the tree storing the filename.

Josh

[1] MacRoman is an extended ASCII character set native to classic Mac OS. <http://en.wikipedia.org/wiki/Mac_OS_Roman>


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]