Re: [PATCH v7 0/7] convert: add support for different encodings

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 23 Feb 2018 12:11:05 -0800

Junio C Hamano <gitster@xxxxxxxxx> writes:

> Lars Schneider <larsxschneider@xxxxxxxxx> writes:
>
>> I still think it would be nice to see diffs for arbitrary encodings.
>> Would it be an option to read the `encoding` attribute and use it in
>> `git diff`?
>
> Reusing that gitk-only thing and suddenly start doing so would break
> gitk users, no?  The tool expects the diff to come out encoded in
> the encoding that is specified by that attribute (which is learned
> from get_path_encoding helper) and does its thing.
>
> I guess that gitk uses diff-tree plumbing and you won't be applying
> this change to the plumbing, perhaps?  If so, it might not be too
> bad, but those who decided to postprocess "git diff" output (instead
> of "git diff-tree" output) mimicking how gitk does it by thinking
> that is the safe and sane thing to do will be broken by such a
> change.  You could do "use the encoding only when a command line
> option says so", but then people will add a configuration variable
> to turn it always on and these existing scripts will be broken.
>
> I do not personally have much sympathy for the last case (i.e. those
> who scripted around 'git diff' instead of 'git diff-tree' to get
> broken), so making the new feature only work with the Porcelain "git
> diff" might be an option.  I'll need a bit more time to formulate
> the rest of my thought ;-)

So we are introducing in this series a way to say in what encoding
the things should be placed in the working tree files (i.e. the
w-t-e attribute attached to the paths).  Currently there is no
mechanism to say what encoding the in-repo contents are and UTF-8 is
assumed when conversion from/to w-t-e is required, but there is no
fundamental reason why it shouldn't be customizable (if anything, as
a piece of fact on the in-repo data, in-repo-encoding is *more*
appropriate to be an attribute than w-t-e that can merely be project
preference at best, as I mentioned earlier in this thread).  

We always use the in-repo contents when generating 'diff'.  I think
by "attribute to be used in diff", what you are reallying after is
to convert the in-repo contents to that encoding _BEFORE_ running
'diff' on it.  E.g. in-repo UTF-16 that can have NUL bytes all over
the place will not diff well with the xdiff machinery, but if you
first convert it to UTF-8 and have xdiff work on it, you can get
reasonable result out of it.  It is unclear what encoding you want
your final diff output in (it is equally valid in such a set-up to
desire your patch output in UTF-16 or UTF-8), but assuming that you
want UTF-8 in your patch output, perhaps we do not have to break
gitk users by hijacking the 'encoding' attribute.  Instead what you
want is a single bit that says between in-repo or working tree which
representation should be given to the xdiff machinery.  When that
bit is set, then we

 - First ensure that both sides of the diff input is in the working
   tree encoding by running it through convert_to_working_tree();

 - Run xdiff on it;

 - Take the xdiff result, and run it through convert_to_git(),
   before emitting (this is optional, making this a one-and-half bit
   option).

That would allow you to say "I have in-repo data in UTF-16 which I
check out as UTF-8.  xdiff machinery is unhappy.  Please do
something." perhaps?

The other way around (i.e. in-repo is UTF-8, but working tree
encoding is UTF-16) won't need xdiff issues, I would imagine.