Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?

Joshua Juran <jjuran@xxxxxxxxx> · Fri, 22 Oct 2010 11:28:44 -0700

On Oct 22, 2010, at 9:18 AM, Jonathan Nieder wrote:

Drew Northup wrote:

       This is part of my ongoing work to treat UTF-16 as text (in
other words, the crlf options will work and .gitattributes hacks  
won't
be required to display diffs, etc).

I would like to see the same thing for MacRoman-encoded text.[1]  This  
is the encoding used by classic Mac development tools such as  
Metrowerks C/C++ (packaged as CodeWarrior) and Apple's Rez resource  
compiler (even the version in OS X).  Clearly, UTF-8 checkouts are not  
an option here.

What's wrong with .gitattributes for this use case?  I would think a
clean/smudge filter would produce very good behavior from most git
commands.

I wrote a Mac<->UTF-8 converter in C++ and set it as the clean/smudge  
filter for .r (Rez) files.  Checkouts were noticeably slower (on a  
real machine, not one of my antiques).  This would be much worse if I  
also applied it to C and C++ source files (most, but not all, of which  
are ASCII anyway).

If speed is the issue, maybe a built-in clean/smudge filter would
address that?

While the performance cost could be overlooked, a worse problem  
occurred when I checked out a branch into which the conversion of  
files from MacRoman to UTF-8 hadn't occurred.  It automatically  
dirtied my working tree, requiring me to temporarily disable the  
filter attribute and reset --hard.  I also resorted to checkout -f a  
number of times -- a bad habit, I'm sure.

In the end I concluded that (a) these files are definitely text, and  
(b) they are natively MacRoman and should be stored that way.  There  
is no advantage to using UTF-8 since the tools can't handle it, and  
even were one to write a UTF-8-capable Rez compiler, the resources it  
outputs are still MacRoman-encoded, so no Unicode support is possible.

Finally, (c) the end-to-end principle applies.  Don't convert data en  
route, but wait until it's necessary.  Premature conversion was the  
curse of FTP; let's not repeat it.  But Git should definitely convert  
data to match the encoding of the display device; writing anything but  
valid UTF-8 to a UTF-8 terminal is in error.  The same applies in gitk.

Another issue (on which my thoughts are less clear) is the use of CR  
newlines.  CR is also native to classic Mac OS, but in contrast to  
UTF-8, Mac developer tools are generally newline-agnostic, whereas  
typical Unix programs are less forgiving in expecting LF -- so I've  
been using linefeeds in my source code.  However, it might be useful  
to retain platform-customary newlines for the purpose of guessing non- 
UTF character encodings:  Carriage returns would almost certainly  
indicate MacRoman rather than ISO-8859-1.  But a more complete and  
robust solution would be to store the encoding somewhere, possibly in  
the blob itself, or in the tree storing the filename.

Josh

[1] MacRoman is an extended ASCII character set native to classic Mac  
OS.  <http://en.wikipedia.org/wiki/Mac_OS_Roman>

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html