> On 23 Feb 2018, at 21:11, Junio C Hamano <gitster@xxxxxxxxx> wrote: > > Junio C Hamano <gitster@xxxxxxxxx> writes: > >> Lars Schneider <larsxschneider@xxxxxxxxx> writes: >> >>> I still think it would be nice to see diffs for arbitrary encodings. >>> Would it be an option to read the `encoding` attribute and use it in >>> `git diff`? >> >> Reusing that gitk-only thing and suddenly start doing so would break >> gitk users, no? The tool expects the diff to come out encoded in >> the encoding that is specified by that attribute (which is learned >> from get_path_encoding helper) and does its thing. >> >> I guess that gitk uses diff-tree plumbing and you won't be applying >> this change to the plumbing, perhaps? If so, it might not be too >> bad, but those who decided to postprocess "git diff" output (instead >> of "git diff-tree" output) mimicking how gitk does it by thinking >> that is the safe and sane thing to do will be broken by such a >> change. You could do "use the encoding only when a command line >> option says so", but then people will add a configuration variable >> to turn it always on and these existing scripts will be broken. >> >> I do not personally have much sympathy for the last case (i.e. those >> who scripted around 'git diff' instead of 'git diff-tree' to get >> broken), so making the new feature only work with the Porcelain "git >> diff" might be an option. I'll need a bit more time to formulate >> the rest of my thought ;-) > > So we are introducing in this series a way to say in what encoding > the things should be placed in the working tree files (i.e. the > w-t-e attribute attached to the paths). Currently there is no > mechanism to say what encoding the in-repo contents are and UTF-8 is > assumed when conversion from/to w-t-e is required, but there is no > fundamental reason why it shouldn't be customizable (if anything, as > a piece of fact on the in-repo data, in-repo-encoding is *more* > appropriate to be an attribute than w-t-e that can merely be project > preference at best, as I mentioned earlier in this thread). Correct. > We always use the in-repo contents when generating 'diff'. I think > by "attribute to be used in diff", what you are reallying after is > to convert the in-repo contents to that encoding _BEFORE_ running > 'diff' on it. E.g. in-repo UTF-16 that can have NUL bytes all over > the place will not diff well with the xdiff machinery, but if you > first convert it to UTF-8 and have xdiff work on it, you can get > reasonable result out of it. It is unclear what encoding you want > your final diff output in (it is equally valid in such a set-up to > desire your patch output in UTF-16 or UTF-8), but assuming that you > want UTF-8 in your patch output, perhaps we do not have to break > gitk users by hijacking the 'encoding' attribute. Instead what you > want is a single bit that says between in-repo or working tree which > representation should be given to the xdiff machinery. I fear that we could confuse users with an additional knob/bit that defines what we diff against. Git always diff'ed against in-repo content and I feel it should stay that way. However, I agree with your earlier emails that "working-tree-encoding" is just one half of the feature. I also think it would be nice to be able to define the "in-repo-encoding" as well. Then we could define something like that: *.foo text in-repo-encoding=UTF-16LE This tells Git that the file is stored as UTF-16LE. This would help Git generating a diff via UTF-8 conversion. I feel that the final patch should be in UTF-16LE again. Maybe over time we could then deprecate the "encoding" attribute as the "in-repo-encoding" attribute serves a similar purpose (maybe gitk can switch to it). In that case we could also do things like that: *.bar text working-tree-encoding=SHIFT-JIS in-repo-encoding=UTF-16LE SHIFT-JIS encoded files would be reencoded to UTF-16LE on checkin. On checkout the opposite would happen. This way we would lift the "UTF-8 is the only in-repo encoding" limitation of the current w-t-e implementation. Does this sound sensible to you? That being said, I think "in-repo-encoding" would deserve an own series. - Lars