しらいしななこ <nanako3@xxxxxxxxxxxxxx> writes: > Quoting Junio C Hamano <gitster@xxxxxxxxx>: > >> This teaches "diff hunk header" function about custom character >> encoding per path via gitattributes(5) so that it can sensibly >> chomp a line without truncating a character in the middle. >> >> Signed-off-by: Junio C Hamano <gitster@xxxxxxxxx> >> --- >> >> * This is not intended for serious inclusion, but was done more >> as a demonstration of the concept, hence [3/2]. > > Why not? It looks a useful addition for us Japanese people. (offtopic) I was once told that "us Japanese people" is a bad thing to say in public because it sets an unfriendly tone by creating a psychological divide between "us" and "others". After all I am one of you ;-) The reason I do not like the patch as-is is because I have doubts about the way "coding" acts in the patch. There already is clean/smudge filter mechanism. Even if your work tree has files in euc-jp or Shift_JIS, you could choose to internally use UTF-8 at git object level. Then the part that deals with diff hunk headers (the topic of the patch we are discussing) would have to work only on UTF-8 data. Side note: when getting the data from a file in the work tree, we convert into internal representation before running diff (see diff.c::diff_populate_filespec()), but we do not convert it back to external representation by running the smudge filter on the diff output. We might optionally want to but if somebody is going to do this, the patch accepting side also needs to be modified to reverse the conversion. The solution with clean/smudge is not applicable to everybody. It needs to be agreed upon project-wide what encoding is used as the canonical encoding for the project, and when the project chooses to use UTF-8, the above would become a cleaner and workable approach. If the project, on the other hand, chooses to use a non UTF-8 encoding (e.g. euc-jp) as the canonical representation, something like my patch may be necessary. Between these two ways to skin the cat, I do not want to close the door for either one of them too early, although I am somewhat partial to "internally everything is UTF-8" approach. Maybe we would want to use "coding" (short, sweet and nice name for an attribute) to mean a canned smudge/clean filter pair that runs to/from UTF-8 iconv, making the "internally, everything is UTF-8" approach a more officially supported option. If we choose to go that route, the way "coding" attribute was used in the patch directly conflicts with that design, as the world view my "coding" patch takes is "whatever coding project chooses is used internally, and the attribute is used to teach coding specific actions to the underlying logic". >> +static struct { >> + const char *coding; >> + sane_truncate_fn fn; >> +} builtin_truncate_fn[] = { >> + { "euc-jp", truncate_euc_jp }, >> + { "utf-8", NULL }, >> +}; > >Can you also do JIS and Shift JIS? I ask because many of my >old notes are in Shift JIS and I think it is the same for many >other people. I guess somebody else could (hint, hint,...). Shift_JIS should be more or less straightforward to add. With the current code structure, however, ISO-2022 (you said "JIS" -- Japanese often use that word to mean 7-bit ISO-2022 and so did you in this context) is a bit cumbersome to handle, as you cannot just truncate but also have to add a few escape bytes to go back to ASCII at the end of line. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html