Re: [RFC] Print diffs of UTF-16 to console / patches to email as UTF-8...?

Jonathan Nieder <jrnieder@xxxxxxxxx> · Fri, 22 Oct 2010 14:53:31 -0500

Joshua Juran wrote:

> I would like to see the same thing for MacRoman-encoded text.[1]
> This is the encoding used by classic Mac development tools such as
> Metrowerks C/C++ (packaged as CodeWarrior) and Apple's Rez resource
> compiler (even the version in OS X).  Clearly, UTF-8 checkouts are
> not an option here.

Yes, makes sense.

There are (at least) two approaches you could use here: treat the
content as precious and use e.g. textconv for readable diffs, or
treat the content as UTF-8 text and use clean/smudge to ensure
the checkout has the right encoding.

So let's see what happens with the latter:

> I wrote a Mac<->UTF-8 converter in C++ and set it as the
> clean/smudge filter for .r (Rez) files.  Checkouts were noticeably
> slower (on a real machine, not one of my antiques).

Vague ideas to mitigate that:

 a) allow a single clean/smudge filter invocation for a batch of
    files
 b) cache, as Jeff hinted
 c) allow custom "native" clean/smudge filters, executed using dlopen()

> While the performance cost could be overlooked, a worse problem
> occurred when I checked out a branch into which the conversion of
> files from MacRoman to UTF-8 hadn't occurred.  It automatically
> dirtied my working tree, requiring me to temporarily disable the
> filter attribute and reset --hard.  I also resorted to checkout -f a
> number of times -- a bad habit, I'm sure.

The jn/merge-renormalize topic from pu might help somewhat (or might
not).  In any event, if you have a test case, I would be happy to look
at it.

> In the end I concluded that (a) these files are definitely text, and
> (b) they are natively MacRoman and should be stored that way.  There
> is no advantage to using UTF-8 since the tools can't handle it, and
> even were one to write a UTF-8-capable Rez compiler, the resources
> it outputs are still MacRoman-encoded, so no Unicode support is
> possible.
> 
> Finally, (c) the end-to-end principle applies.

Yep.

Although "definitely text" seems somewhat abstract to me.  Is the
problem that "git diff" fails to default to --text in some situation?

>                                         But Git should definitely
> convert data to match the encoding of the display device; writing
> anything but valid UTF-8 to a UTF-8 terminal is in error.

Oh, this is what you mean.  Except for log encoding, git is not paying
attention to the display encoding at all.

[...]
>                                                      But a more
> complete and robust solution would be to store the encoding
> somewhere, possibly in the blob itself, or in the tree storing the
> filename.

How about Jakub's idea of keeping it in .gitattributes (or some
similarly visible key/value store)?  Two reasons:

 1. When asked to declare encoding, half the time people will be
    wrong.  So it seems worthwhile to make the declared encoding
    visible enough to fix.

 2. Two ASCII files identical except that one is declared as
    latin1 and the other utf8 should be considered identical.

Thanks for some food for thought.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html