Junio C Hamano <gitster@xxxxxxxxx> writes: > Lars Schneider <larsxschneider@xxxxxxxxx> writes: > >> I still think it would be nice to see diffs for arbitrary encodings. >> Would it be an option to read the `encoding` attribute and use it in >> `git diff`? > > Reusing that gitk-only thing and suddenly start doing so would break > gitk users, no? The tool expects the diff to come out encoded in > the encoding that is specified by that attribute (which is learned > from get_path_encoding helper) and does its thing. > > I guess that gitk uses diff-tree plumbing and you won't be applying > this change to the plumbing, perhaps? If so, it might not be too > bad, but those who decided to postprocess "git diff" output (instead > of "git diff-tree" output) mimicking how gitk does it by thinking > that is the safe and sane thing to do will be broken by such a > change. You could do "use the encoding only when a command line > option says so", but then people will add a configuration variable > to turn it always on and these existing scripts will be broken. > > I do not personally have much sympathy for the last case (i.e. those > who scripted around 'git diff' instead of 'git diff-tree' to get > broken), so making the new feature only work with the Porcelain "git > diff" might be an option. I'll need a bit more time to formulate > the rest of my thought ;-) So we are introducing in this series a way to say in what encoding the things should be placed in the working tree files (i.e. the w-t-e attribute attached to the paths). Currently there is no mechanism to say what encoding the in-repo contents are and UTF-8 is assumed when conversion from/to w-t-e is required, but there is no fundamental reason why it shouldn't be customizable (if anything, as a piece of fact on the in-repo data, in-repo-encoding is *more* appropriate to be an attribute than w-t-e that can merely be project preference at best, as I mentioned earlier in this thread). We always use the in-repo contents when generating 'diff'. I think by "attribute to be used in diff", what you are reallying after is to convert the in-repo contents to that encoding _BEFORE_ running 'diff' on it. E.g. in-repo UTF-16 that can have NUL bytes all over the place will not diff well with the xdiff machinery, but if you first convert it to UTF-8 and have xdiff work on it, you can get reasonable result out of it. It is unclear what encoding you want your final diff output in (it is equally valid in such a set-up to desire your patch output in UTF-16 or UTF-8), but assuming that you want UTF-8 in your patch output, perhaps we do not have to break gitk users by hijacking the 'encoding' attribute. Instead what you want is a single bit that says between in-repo or working tree which representation should be given to the xdiff machinery. When that bit is set, then we - First ensure that both sides of the diff input is in the working tree encoding by running it through convert_to_working_tree(); - Run xdiff on it; - Take the xdiff result, and run it through convert_to_git(), before emitting (this is optional, making this a one-and-half bit option). That would allow you to say "I have in-repo data in UTF-16 which I check out as UTF-8. xdiff machinery is unhappy. Please do something." perhaps? The other way around (i.e. in-repo is UTF-8, but working tree encoding is UTF-16) won't need xdiff issues, I would imagine.